AD-A109  77*  STANFORD  UNXV  CA  DEPT  OF  OPERATIONS  RESEARCH  F/O  11/1 

THE  SHIFT-FUNCTION  APPROACH  FOR  MARKOV  DECISION  PROCESSES  WITH  — ETCIU) 
JUL  SI  S  STIDHAM#  J  VAN  NUNCN  N0001*-7*-C-0*l* 


THE  SHI FT- FUNCTION  APPROACH  FOR  MARKOV  DECISION 
PROCESSES  WITH  UNBOUNDED  RETURNS 

BY 


SHALER  STIDHAM,  JR.  and  JO  VAN  NUNEN 


TECHNICAL  REPORT  NO.  98 
JULY  1981 


PREPARED  UNDER  CONTRACT 
N00014-76-C-0418  (NR-047-061) 

FOR  THE  OFFICE  OF  NAVAL  RESEARCH 


£ 


Frederick  S.  Hillier,  Project  Director 


Reproduction  in  Whole  or  in  Part  is  Permitted 
for  any  Purpose  of  the  United  States  Government 

This  document  has  been  approved  for  public  release  and  sale; 
its  distribution  is  unlimited 


DEPARTMENT  OF  OPERATIONS  RESEARCH 
STANFORD  UNIVERSITY 
STANFORD,  CALIFORNIA 


THE  SHIFT- FUNCTION  APPROACH  FOR  MARKOV  DECISION 
PROCESSES  WITH  UNBOUNDED  RETURNS 


BY 

*  **  . - 

‘‘■haler  Stidham, Jr ,  and  Jo  van  Nunen  I  £ 


TECHNICAL  REPORT  NO.  98 
JULY  1981 


X 


PREPARED  UNDER  CONTRACT 
N000 14-7  6-C-04 1 8  (NR-047-061) 

FOR  THE  OFFICE  OF  NAVAL  RESEARCH 

Frederick  S.  Hillier,  Project  Director 


Reproduct : on  in  Whole  or  in  Part  is  Permitted 
for  any  purpose  of  the  United  States  Government 

This  document  has  been  approved  for  public  release 
and  sale;  its  distribution  is  unlimited. 

DEPARTMENT  OF  OPERATIONS  RESEARCH 
STANFORD  UNIVERSITY 
STANFORD,  CALIFORNIA 

Tills  research  was  supported  in  part  by  National  Science  Foundation 
Grant  ECS  80-17867  Department  of  Operations  Research,  Stanford  University 
and  issued  as  Technical  Report  No.  60. 

*The  research  of  this  author  was  partially  supported  by  National  Science 
Foundation  Grant  No.  ENG  78-74420  at  North  Carolina  State  University. 

**Graduate  School  of  Management,  Delft,  The  Netherlands.  The  research  of  this 
author  was  done  during  an  appointment  as  Visiting  assistant  Professor  of 
Operations  Research  and  Mathematics,  North  Carolina  State  University,  January 
to  June,  1978. 

The  first  draft  of  this  report  was  titled  "Uniform  Convergence  of  Successive 
Approximations  in  Dynamic  Programming  with  Unbounded  Returns  and  Non-Zero 
Terminal  Value  Function". 


O.  Introduction 


We  consider  a  Markov  decision  process  with  a  general  state  space  and  general 
action  space.  The  system  is  observed  at  discrete  points  in  tine.  If  it  is  in 
state  s  and  action  a  is  taken,  then  an  immediate  return  r(s,a)  is  earned  and  the 
system  makes  a  transition  to  a  new  state  according  to  the  (possibly  defective) 
transition  probability  measure  p(s,a;-).  The  objective  is  to  maximize  the 
expected  total  return  over  a  finite  or  infinite  horizon  from  each  possible  start- 
in  state.  Discounting  is  accommodated  by  incorporating  the  discount  factor  in 
the  transition  probabilities. 

We  seek  a  set  of  realistic  and  easily  verified  conditicr.8  under  which  opti¬ 
mal  policies  can  be  compute  1  (or  approximated)  in  an  efficient  wav.  ’Jhat 
distinguishes  our  approach  from  many  others  in  the  literature  i3  that  the  condi¬ 
tions  we  develop  are  appropriate  to  a  specific  class  of  Markov  decision  processes: 
those  that  arise  in  the  control  of  stochastic  service  and  storage  systems,  3uch  as 
queueing,  replacement,  and  inventory  systems.  Under  these  conditions  we  are  able 
to  establish  the  standard  results  of  the  theory  of  Markov  ueclsion  processes: 

(i)  that  the  optimal  value  function  satisfies  the  optimality  equation  of  dynamic 
programing,  (ii)  that  it  is  the  unique  solution  in  a  certain  class  of  functions, 
(iii)  that  a  stationary  policy  attaining  the  maximum  in  the  optimality  equation 
la  optimal  among  all  policies,  and  (iv)  that  the  method  of  success. ve  approxi¬ 
mations  converges  (in  other  words,  the  finite-horizon  optimal  value  functions 
approach  the  infinite-horizon  optimal  value  function). 

It  is  customary  in  the  literature  to  start  with  a  general  Markov  decision 

model  and  impose  regularity  conditions,  such  as  e.g.  uniform,  polynomial,  or 

exponential  bounds  on  the  return  functions  [3  ],  [ 14] ,  [20],  [37],  as  they  are 

needed  to  derive  desired  results.  By  contrast,  we  begin  with  what  we  feel  to  be 
appropriate  abstractions  of  the  applications  in  which  we  are  interested.  We 


Impose  on  our  model  a  set  of  conditions,  mostly  involving  monotonicity  of  return 
functions  and  transition  probabilities,  that  are  common  to  many  of  the  control 
models  for  queueing,  replacement,  and  inventory  systems  in  the  literature,  and 
then  work  toward  achieving  as  many  of  the  goals  (i)  -  (iv)  as  possible.  In  fact, 
we  are  able  to  achieve  all  four  goals.  Moreover,  we  are  able  to  do  this,  by  means 
of  certain  transformations,  in  the  context  of  an  approach  based  on  contraction 
mappings  with  respect  to  the  sup  norm,  so  that  the  convergence  of  successive 
approximations  is  uniform  and  geometric.  This  property  makes  it  possible  to  use 
various  techniques  to  accelerate  convergence,  such  as  bounds,  elimination  of  sub- 
optimal  actions,  and  transformations  to  reduce  the  spectral  radius  of  the  process 
(see.  for  example,  [  8],  [9],  [17],  [18],  [19],  [20],  [22]). 

In  Section  1  we  Introduce  our  basic  Markov  decision  model  and  establish 
notation.  The  special  cases  of  this  model  studied  in  this  paper  are  all  basically 
examples  of  the  essentially  negative  model  of  Ulnderer  [11].  That  is,  the  one- 
stage  return  function  is  bounded  above  and  the  one-stage  discount  factor  is 
strictly  smaller  than  one.  Like  Sch&l  [24]  we  allow  the  discount  factor  to  depend 
on  the  state  and  action,  so  that  our  model  covers  semi-Markov  decision  processes. 
Our  notation  incorporates  the  discount  factor  in  the  transition  probabilities  and 
allows  for  defective  transition  probabilities  (cf.  p0] ,  pi])  .  For  n-stage 
problems  we  extend  the  Sch&l  model  to  allow  a  non-zero  terminal-value  (scrap) 
function  (cf.  [10],  p2]).  In  the  context  of  the  successive-approximations 
method  for  infinite-stage  problems,  this  is  equivalent  to  allowing  a  non-zero 
starting  function. 

In  Sections  2  and  3  we  study  two  special  cases  of  the  basic  decision  model 
of  Section  1,  both  of  which  abstract  some  of  the  properties  comnonly  found  in 
stochastic  service  and  storage  systems.  The  model  of  Section  2  allows  unbounded 
return  functions,  but  the  structure  of  the  return  function  and  transition 
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probabilities  gives  rise  to  bounded  optimal  value  functions.  Hence  goals  (i)  - 
(iv)  can  be  achieved  aud,  moreover,  the  convergence  of  the  method  of  successive 
approximations  is  uniform  with  respect  to  the  sup  norm.  Although  the  conditions 
of  this  model  may  seem  restrictive,  we  give  an  example  from  queueing  control  to 
show  that  they  can  be  satisfied  in  some  applications. 

The  model  of  Section  3  also  allows  unbounded  return  functions,  but  places 
fewer  restrictions  on  the  return  functions  and  transition  probabilities  and  hence 
is  applicable  to  a  wider  class  of  liarkov  decision  processes.  In  fact  the  condi¬ 
tions  of  this  model  seem  to  be  satisfied  by  nearly  all  the  queueing-control 
models  considered  in  the  literature.  We  give  some  examples  in  support  of  this 
assertion  and  also  present  an  inventory  model  in  which  our  conditions  are  satis¬ 
fied.  The  assumptions  for  this  model  are  weaker  than  those  of  the  models  in 
the  literature  in  which  an  (s,S)  policy  is  shown  to  be  optimal.  As  in  the  case 
of  the  model  of  Section  2,  we  are  able  to  achieve  goals  (i)  -  (iv)  for  the 
model  of  this  section,  and  establish  uniform  convergence  of  successive  approxi¬ 
mations.  The  optimal  value  functions  are  not  bounded  for  this  model,  however. 

By  means  of  a  shift  transformation  [7],  [20  ] ,  [21]  we  show  how  to  convert  this 
model  to  an  equivalent  model  satisfying  the  conditions  of  Section  2  and  give  an 
economic  interpretation  of  the  transformation.  As  a  shift  function  we  propose 
(among  other  possibilities)  the  infinite-horizon  value  function  from  a  particular 
reference  policy  of  a  simple  form.  In  queueing-control  models  the  reference 
policy  is  usually  an  extremal  policy,  e.g.,  the  policy  that  rejects  all  customers 
in  an  arrival-control  problem  or  the  policy  that  alwrys  uses  the  maximal  service 
rate  in  a  service-rate-control  problem.  In  the  inwrtory-control  model  the 
reference  policy  could  be  the  policy  that  orders  nothing  when  inventory  is  posi¬ 
tive  and  orders  up  to  zero,  if  possible,  when  the  inventory  is  negative. 


-4- 


1.  Basic  Decision  Model 

We  now  give  a  detailed  description  of  the  basic  Markov  decision  model  and 
establish  the  notation  that  will  be  used  throughout  the  paper.  Informally  speaking, 
the  object  of  study  is  a  system  that  can  be  controlled  at  discrete  time  points, 

called  stages  and  labeled  t  “  0,1,2 .  If  at  stage  t  the  system  is  observed  to 

be  in  state  s  e  S,  the  decision  maker  can  select  an  action  a  e  D(s) ,  the  set  of 
admissible  actions.  If  he  selects  action  a  in  state  s  at  stage  t,  then  an 
immediate  return  r(s,a)  is  earned  and  the  state  at  stage  t  +  1  will  be  in  the 
set  B  with  probability  p(s,a;  B) . 

As  in  [20]  ,[21]  we  allow  for  defective  transition  probabilities,  i.e. 
p(s,a;S)  <  1.  The  model  thus  includes  discounted  Markov  and  semi-Markov  decision 
processes,  as  well  as  stopping  problems.  In  a  more  formal  development,  the  model 
can  be  converted  to  one  with  proper  transition  probabilities  by  introducing  an 
absorbing  state  (cf.  PQ ,  PI]).  As  this  conversion  is  by  now  standard  in  the 

i^^era^ure»  we  shall  assume  that  it  has  been  done  and  make  no  further  reference 
to  it. 

A  11  is  defined  in  the  usual  way  [3  ] ,  &4] ,  Hll ,  [24] ,  [26]  as  a  collection 

of  decision  rules  for  choosing  actions  at  each  stage  t.  A  policy  is  called 

stationary,  and  denoted  simply  by  f,  if  it  always  chooses  the  same  action, 

a  =  f(s)  e  D(s),  whenever  the  system  is  in  state  s  e  S.  The  set  of  all 

stationary  policies  will  be  denoted  by  F.  Each  starting  state  s  e  S  and  policy 

it,  together  with  the  transition  probability  measure  p,  determine  a  stochastic 

process  {(X  ,  A  ) ,  t  e  0,1,...)  with  associated  probability  measure  PT  .  Here 

s 

xt  denotes  the  state  and  Af  the  action  at  stage  t.  Ue  shall  denote  by  E*  [ •] 
the  expectation  operator  associated  with  ,  and  write  E*  [•]  to  denote  the 


r 
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functlon  that  assigns  value  E17  [•]  to  the  point  s  e  S. 

s 


In  order  to  keep  the  exposition  simple,  we  shall  make  no  reference  in  this 
paper  to  measure-theoretic  and  topological  conditions  needed  to  ensure  that  the 
stochastic  processes  and  associated  measures  P*  are  well  defined.  Our  development 
is  rigorous  if,  for  example,  all  the  sets  referred  to  above  are  standard  Borel 
spaces  and  the  functions  are  measurable.  (See,  e.g.,  Hinderer  [11],  Schai  [24] , 
Serfozo  [2(3,  or  Stidham  [33  for  details.) 


Associated  with  each  policy  ir,  formally  define  the  infinite  -horizon  value 
function  Vv  by 

As):  -  E*  [E  r(X  ,A  ) ]  (s  e  S) 

9  t=0  c  c 

and  define  the  infinite  -  horizon  optimal  value  function  V*  by 

V*(n) :  =■  sup  As)  (s  c  S) 

ir 

In  order  to  ensure  that  the  expectations  are  well  defined,  we  shall  make  specific 
assumptions  regarding  r  and  p  for  each  of  the  specific  models  considered  in  this 
paper.  In  all  cases  the  models  will  be  special  cases  of  an  essentially  negative 
model  (cf  Hinderer  [lfl). 

For  finite-stage  problems  we  allow  a  terminal-value  (scrap)  function, 

v0’  S  -*■  It.  That  is,  if  the  horizon  length  is  n,  then  the  system  terminates  upon 

reaching  stage  t»n  and  earns  a  terminal  value  V  (X  ) .  Associated  with  each 

o  n 

policy  it,  formally  define  the  n-stage  value  function  V*  by 
n-l 

As):  -  Ell  r(X  ,  A  )  +  V  (X  )  ] 
n  s  Q  t  t  on 


L 


(s  e  S) 


-6- 


and  the  n-atage  optimal  value  function  V*  by 


V  (s) :  “  sup  V  (s)  (s  e  S) 

n  n 

IT 

Again,  we  shall  impose  specific  conditions  on  r,  p,  and  Vq  to  ensure  that  these 
functions  are  well  defined  for  each  of  the  models  considered. 

For  a  stationary  policy  f  define  the  operator  P(f)  by 

(P(f)v)  (s) :  ■  /p(s,f(s);  ds')v(s')  (s  e  S) 

for  all  functions  v  such  that  the  integral  is  well  defined.  Similarly,  define  the 
operator  L(f)  by 

L(f) v:  -  r(f)  +  P(f)v, 

where  r(f)  is  the  function  whose  value  for  the  argument  s  is  r(s,f(s)). 

It  follows  from  the  definitions  that,  for  a  stationary  policy  f, 

Vf  *  L(f)Vf  =  r (f )  +  P(f)Vf, 

(L(f))nO  -*•  Vf ,  as  n  -*•  ». 

Define  the  operator  U  by 

Uv:  ■  sup  .  L(f)v. 
f  el 

Under  the  conditions  satisfied  by  the  models  considered  in  this  paper,  it  can  be 
shown  D3  that  V*  and  satisfy  the  optimality  equations 


(1.1) 

(1.2) 


V*  -  UV* 


(1.3) 
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V  =  UV  -  lA  ,  n  >  1  (1.4) 

n  n-1  o 

As  la  Stidham  p 2]  our  main  goal  la  this  paper  will  be  to  establish  conditious 

(on  r,  p,  and  V  )  under  which  successive  approximations  converges,  that  Is, 
o 

V  -*■  V  ,  In  contrast  to  P3,  In  which  polntvlse  convergence  was  established  using 
n 

extensions  of  the  methods  of  Strauch  P4] ,  Hinderer  (II] ,  and  Sch&l  £4] ,  we  shall 

here  seek  uniform  convergence  (convergence  In  sup  norm)  under  conditions  In  which 

the  operator  P(f)  is  contractive,  but  r(f)  is  unbounded.  As  Indicated  In  the 

Introduction,  our  conditions  are  specifically  tailored  to  fit  control  models  for 

stochastic  service  and  storage  systems. 

For  any  function  v:  S  -*•  R,  define  the  supremum  norm  of  v,  denoted  ||  v||  ,  by 

j|  v  ||  :  -  sup  )v(s)  1 

seS 

Let  R:  ■  Ru  {-<»}  and  let  V:  S  -*•  R  be  a  given  function  (hereafter  called  a 
reference  function).  Let  CJ(V)  be  the  Banach  space  of  all  functions  uniformly 
bounded  away  from  V.  That  Is, 

W(V)J  -  (v:  S  -*■  R  1  II  v-V  II  <  »}. 

The  following  lemma,  which  follows  from  (1.3)  and  the  contraction-mapping  fixed- 
point  theorem,  will  be  used  frequently  in  our  analysis. 

Lenina  1.1.  Suppose 

(I)  V*  e  (J(V) 

(II)  U  Is  contractive  on  kUV)  >  i.e.,  there  exists  a  P,  0  <  P  <1,  such  that 

||  Uu  -  Uv  ||  <  p  ||  u  -  v  ||  for  all  u,  v  e  'J(V) . 

Then,  for  all  v  e  U(V) ,  as  n  +  «  , 

||  A  -  v*  1 1  <  p  n  ||  v  -  v*  ||  -*■0 

and  V*  is  the  unique  fixed  point  of  U  in  U(V) . 

(It  should  be  noted  that  (1)  and  (11)  Imply  that  U:  W(V)  U(V).) 


2.  Model  I 


In  this  section  we  study  a  special  case  of  the  basic  decision  model  presented 
in  the  previous  section.  The  special  structure  of  this  model  makes  it  possible 
to  apply  the  theory  of  contraction  mappings  to  deduce  uniform  geometric  convergence 
of  the  method  of  successive  approximations,  even  though  the  one-period  return 
function  may  be  unboinded.  In  fact,  our  simple  assumptions  describe  an  Important 
class  of  Markov  decision  processes  for  which  the  classical  theory  of  Markov 
decision  processes,  which  was  developed  under  conditions  that  only  allowed  uni¬ 
formly  bounded  one-stage  returns,  is  applicable.  The  difference  between  our 
conditions  and  the  classical  conditions  (cf«,  e.g.,  [3  ),  [5  ])  is  that  we  do  not 
require  r(f)  to  be  uniformly  bounded  for  all  f,  but  only  for  certain  f.  Our 
conditions  may  in  fact  be  part  of  the  folklore,  but  we  could  not  find  them  in  the 
literature. 

Although  the  assumptions  of  our  cod&l  may  at  first  seem  artificial  and  restric¬ 
tive,  they  are  satisfied  in  certain  applications,  as  we  shall  demonstrate  in  the 
latter  part  of  this  section  by  means  of  an  example  from  the  control  of  queues. 

Of  more  significance,  perhaps,  is  the  fact  that  for  a  wide  class  of  decision  models. 
Including  most  of  the  queueing-control  models  in  the  literature,  it  is  possible  to 
transform  the  problem  into  an  equivalent  problem  that  satisfies  the  assumptions  of 
this  section.  We  shall  demonstrate  this  transformation  in  the  next  section. 

Let  e:  S  ■*  R  be  the  unit  function;  that  is,  e(s)  =  1.  For  each  stationary 
policy  f,  define  the.  supretnum  norm  of  the  transition  operator  P(f)  by 
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||  P(f)||  •  =  sup  (P(f)e)  (3) 
scS 

=  sup  p(s, f (s) ,  S)  . 
seS 


Also  define  r-  S  •*  Iu{4«>}  by 

r  :  »  sup  r(f) 
f  eF 

Ue  shall  need  the  following  conditions. 

Condition  2.1.  p:  *  sup  j|  P (f )  ||  <  1 

f  eF 

Condition  2.2  M:  *  ||  r||  <  °> 


For  discounted  problems  Condition  2.1  is  equivalent  to  having  the  discount 
factor  uniformly  strictly  less  than  one.  Condition  2.2  implies  (but  is  not 
implied  by)  having  the  one-stage  return  r(f)  uniformly  bounded  above.  Thus  Model 
I  is  a  special  case  of  an  essentially  negative  dynamic-programming  model,  iiote 
that  Condition  2,2  doe3  not  require  the  one-stage  returns  to  be  bounded  below, 
so  that  Model  I  is  not  a  special  case  of  the  classical  discounted  bounded-return 
model  of  Blackwell  and  Denardo.  However*  we  do  require  r(f)  to  be  uniformlv 
bounded  for  the  subclass  of  myopic  stationary  policies,  that  is,  policies  f  for 
which  r(f)  =  r. 

In  this  section  we  shall  be  interested  in  the  Banach  space  1'.  =  W(0)  of  all 
uniformly  bounded  functions,  that  is, 

W:  =  {v:  s-*R|  ||  v  ||  <  ®  } 

The  key  results  of  this  section  are  contained  in  the  following  theorem. 

Theorem  2.1.  Assume  Conditions  (2.1)  and  $2.2).  Then  V*cW  is  the  unique  bounde 
solution  to  the  optimality  equation 
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v  =  Uv 

and,  for  any  V  clJ,  I)  V*  V  j|  <-  n|]  V*  -  V  II  <  pn[  (1  )  1  +  |J  V  i|  j'  <• 

o  1  n  —  1  o  —  o 

so  that  V  =  UnV  -*■  V*  uniformly  and  geometrically  (with  respect  to  the  :.up  uu/m) 
n  o 

Proof .  These  results  toll o\:  i  nun  Lemma  3.3  and  Theorem  3.1  in  van  uaen  ann  We.e,,  1~ 
pl]  ,  extended  in  an  obvious  way  to  a  general  (not  necessarily  countable)  state 
apace,  since  Conditions  2.1  and  2.2  Imply  assumption  2.3  of  pi]. 

Alternatively,  the  theorem  may  be  proved  directly  by  verifying  that  (i)  and  <1: 
of  Lemma  p.  p  bold,  oo  chat  the  classical  contraction- mapping  theory  applies. 

Conditions  2.1  and  2.2  give  a  simple  special  case  In  which  the  very  general 
conditions  of  v  in  Nunen  and  Weasels  [21]  hold,  and  hence  the  theory  of  cent  ract 
mappings  based  on  weighted  sup  norms  can  be  applied.  Indeed,  from  the  fact 
(demonstrated  in  Theorem  2,1)  that  only  "good"  (i.e.  myopic)  policies  need  to 
have  bounded  return  functions  in  order  for  the  model  to  be  contractive,  if  follow-- 
that  the  classical  theory  [3],  [5]  based  on  ordinary  sup  norms  is  applicable. 

The  advantage  of  our  conditions  over  the  more  general  conditions  in[21J  is 
that  they  are  simple  to  state  and  can  be  easily  checked  in  applications,  They 
would  not  be  of  much  use,  however,  if  they  were  too  restrictive  to  have  any 
significant  application  (except  trivially  in  the  case  where  r(,)  Itself  is 
bounded.).  In  the  remainder  of  this  section,  we  shall  illustrate  the  application 
of  Conditions  2.1  and  2.2  to  problems  with  special  structure,  which  makes 
verification  of  these  conditions  easy.  To  this  end  we  shall  present  new 
conditions  that  Imply  Conditions  2.1  and  2,2,  in  the  context  of  a  general  mode, 
with  partially  ordered  state  and/or  action  spaces.  There  conditions,  which 
involve  monotonici ty  of  r  and  p,  are  much  stronger  than  necessary  for  condition- 
2.1  and  2.2,  and  consequently  lead  to  sharper  results  than  Theorem  2.1.  They  ire 
related  to  but  weaker  than  conditions  that  have  been  proposed  in  the  literature 
(cf,  e.g.,  Stidham  and  .  ranhu  $3],  Serfozo  £6])  for  a  different  purpose:  namely, 
showing  that  an  optimal  policy  has  a  particular  (e.g.,  monotonic  or  control - limi t ) 
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forra.  Ue  illustrate  the  application  of  these  stronger  conditions  with  an  exanpl  - 
from  control  of  queues.  In  the  next  section  ue  show  how  more  general  problems 
can  often  be  transformed  into  equivalent  problems  having  this  structure,  so  that 
the  results  of  this  section  can  be  appl  d. 

For  the  remainder  of  this  section,  suppose  that  the  state  space  S  is 

partially  ordered  by  a  relatiorlV'  and  that  D(s)  =  A,  for  all  s  e  S.  A  function 

v:  S  -*■  R  is  called  increasing  (decreasing)  if  v(s)  <  v(t)  (v(s)  >  v(t))  for  all 

s  <  t  in  S.  A  set  BC  S  is  called  increasing  (decreasing)  i'  s  c  Z  implies  t  c  B 

whenever  s  <.  t  (s  t) .  We  shall  need  the  following  conditions. 

Condition  2.3.  There  is  an  element  0  e  S  that  is  minimal  with  respect  to  the 
relation  That  is,  s  ^  0  for  all  s  e  S. 

Condition  2.4.  For  all  a  c  A,  r(s,a)  is  decreasing  in  s  e  S :  moreover  sup  r(O.a) 

aeA 

=  :  M  <  00  . 

Condition  2.5.  For  each  s  c  S,  there  exists  an  a  e  A  such  that  r(s,a)  >0. 

Conditions  2.3  and  2.4  imply  that  r(s,a)  <_  M  <  “,  for  all  s  c  S,  a  c  A. 

Condition  2,5  implies  that  r(s)  >.  0,  for  all  s  e  S.  Hence  ||  r  ||  <  H  <  ®  and 
Condition  2.2.  applies.  Moreover,  0  <  V*  <  M(l-p)  ^  .  Hence  ue  have  the 

following  corollary  of  Theorem  2.1. 

Corollary  2,2.  Assume  Conditions  2.1  and  2.3,  2.4,  2.5.  Then  0  <_  V*  <  M(l-p)  ^)<°°_ 
V*  is  the  unique  solution  in  (I  to  the  optimality  equation 

v  =  Uv 

and,  for  any  V  e  I'J,  V  =  V*  uniformly  and  geometrically  with  respect  to 

o  no 

the  sup  norm. 

The  monotonicity  of  r  implied  by  Condition  2.4  makes  it  natural  to  look  for 

conditions  under  which  V  and  V*  will  be  monotonic  functions  on  S.  (Ilonotonicity 

n 

of  V  and  V*  is  often  used  in  proving  that  an  optimal  policy  has  a  particular  form, 
n 


for  Markov  decision  processes  with  special  structure.  See,  for  example,  Sobol 
[2$J  ,  Stidham  and  Prabhu  [33,  and  Serfozo  [2(]).  For  this  purpose  we  shall  need  the 
following  additional  condition. 

Condition  2.6.  For  all  s  t  S,  s’  c  S,  s  <  s' implies  p(s,  a,  B)<  p(s,a,B)  for 
a  e  A  and  each  increasing  set  B'  S. 

"hen  Condition  2.6  holds  we  say  that  p  is  stochastic  illy  increasing  in  s 
(cf.  Lehman  [13],  Messier  and  Velnott  [2],  Serfozo  [ 2 f  j  -  rfozo  calls  such  p  a 
monotone  transition  probability)  From  Serfozo  [26]  we  get  the  following. 

Lemma  2.3.  Condition  2.6  holds  if  and  only  if,  for  every  stationary  policy  f  such 
that  f(s)  H  a,  for  all  s  e  S,  and  every  increasing  (decreasing)  function  vt  S->-  R, 
P(f)v  is  increasing  (decreasing). 

Nov;  the  following  theorem  can  he  proved  easily  by  induction  on  n. 

Theorem  2.6.  Assume  Conditions  2.1,  and  2.3  -  2.6.  If  V'CI  1  1  and  is  decreasing, 
then  e  W  and  is  decreasing,  for  each  n  >  1,  ■+  V*  uniformly  and  geometrically, 

and  V*  is  decreasing  and  the  unique  solution  in  L'  to  v  »  Uv. 

Remark  2.1.  Monotone  return  functions  and  transition  probabilities  are  en¬ 
countered  often  in  applications  of  Markov  decision  processes,  particularly  to 
queueing  and  replacement  systems.  Such  systems  are  often  closely  related  to 
random  walks  and  thus  have  a  (nearly)  additive  transition  structure.  Specifically, 
the  state  represents  the  'quantity"  (e.g.  number  of  customers)  in  the  system 
and  transitions  occur  by  means  of  inputs  and  outputs,  which  are  (nearly)  Indepen¬ 
dent  of  the  current  state  and  either  or  both  of  which  may  be  subject  to  control 
(see  Stidham  and  Prabhu  P3 ,  Section  4.1)>  Such  a  transition  structure  typically 
gives  rise  to  monotone  transition  probabilities.  (For  a  simple  example  in  the 
context  of  a  controlled  random  v;alk,  see  Serfozo  {2f?,  Section  5.)  The  basic 
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idea,  of  course,  Is  that  whatever  the  (tlxed)  action,  the  more  quantity  the 

system  now,  the  more  is  likely  to  be  in  the  system  at  the  next  stage.  The  fact 
that  the  return  function,  r(-,a),  is  decreasing  conies  about  because  there  is 
usually  a  cost  (e.g.,  inventory  holding  cost,  customer  waiting  cost)  associated 
with  having  quantity  in  the  system:  the  more  the  quantity,  the  higher  the  cost. 

* 

Remark  2.2.  The  observation  that  V  >_  0  (see  Corollary  2.2)  depended 
on  the  fact  that  Condition  2.5  implies  the  existence  of  a  stationary  policy  that 
never  takes  an  action  leading  to  a  negative  immediate  return.  At  first  glance 
it  might  seem  that  a  stronger  statement  could  be  made,  namely,  that  without  loss 
of  optimality  one  may  restrict  attention  to  policies  thar  never  take  an  action  a 
from  any  state  s  such  that  r(s,a)  <  0.  If  this  were  true,  then  the  problem  would 
be  equivalent  to  one  in  which  r(s,a)  ^  0,  for  all  s  t  S,  a  c  A.  It  is  easy  to 
construct  u  counterexample,  however,  to  show  that  Condition  2.5  does  not  imply 
that  actions  leading  to  negative  immediate  returns  can  be  ignored.  The  problem, 
of  course.  Is  that  it  may  be  advantageous  to  incur  a  negative  return  now  In  order 
to  get  into  a  set  of  states  with  large  positive  returns. 


There  are,  however,  realistic  additional  conditions  under  which  our  model  is 
equivalent  to  one  with  a  non-negative  return  function.  As  an  example  we  offer 

Condition  2.7.  The  action  space  A  is  partially  ordered  bv  a  relation  For 

all  s  e  S,  a  e  A  such  that  r(s,a)  <  0,  there  exists  an  a'  e  A,  a'  <  a,  such  that 
r(s,a')  >.  0.  For  each  s  c  S,  p(s,a  ;  •)  is  stochasticall v  increasing  in  a  e  A. 

(Tor  an  application  in  which  this  assumption  holds,  see  below.) 


Theorem  2.5.  Assume  Conditions  2.1,  2. 3-2. 7.  Suppose  Vq  e  W  is  decreasing 
Then,  for  each  n  J  ,  V  c  Is  decreasing,  and  satisfies  the  restricted  opti¬ 
mality  equation 


V 

n 


sup  L(f)  V 
fe  F. 


n-1 


(2.1) 
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where  F  :  =  {f  e  F|  r(f)  >  0). 

*r 

*  * 

Itoreover,  V  -*■  V  uniformly  and  geometrically,  and  V  Is  decreasing  and  the 
n 

unique  solution  in  .7  to  the  restricted  optimality  equation 

V  =  sup  L(f )  V  (2.2) 

f£  F+ 

★ 

If  V  =  0,  then  the  convergence  of  V  to  V  is  monotonic  :  V  >  V  , ,  n  >  1. 
o  n  n  “  n-1  ■** 

Proof.  In  light  of  Theorem  2.4,  it  suffices  to  show  that  the  9upremuir.  over 

all  stationary  policies  f  in  the  original  statement  of  the  optimality  equations 

can  be  replaced  by  a  supremum  over  F+  without  loss  of  optimality.  We  have,  for 

example, 

V*  »  uv*  =  sup  (r(f )  +  P(f  )V*} . 
fcF 

Let  f  be  an  arbitrary  stationary  policy.  It  follows  from  Condition  2.7  that 
there  exists  a  policy  f'  e  F  with  f'  ^  f  and  r(f')  >.  r(f).  (set  f'(s)  =  f(s) 
if  r(s,f(s))  0;  for  s  e  S  such  that  f(s)  *  a  and  r(s,a)  <  0,  set  f'(s)  »  a', 

where  a'  <  a  and  r(s,a')  ^  0  >  r(s,a).)  Since  V  is  decreasing  (by  Theorem  2.4) 
it  follows  from  the  second  part  of  Condition  2.7  that  P(f)  V  <_  P(f')  V  .  (The 
situation  parallels  that  covered  by  Lemma  2.3,  except  that  now  we  are  dealing 
with  a  stochastic  ordering  on  A  rather  than  S.)  Therefore, 

r(f')  +  P(f')  V*  >  r(f)  +  P(f)  V* 

and  consequently,  since  f  was  arbitrary  and  f'c  F  , 

sup  L(f)  V  >  sup  L(f)  V  ^  sup  L(f)V, 
fe  F+  ftF  “  ft  F+ 

so  that  (2.2)  holds.  The  proof  of  (2.1)  is  formally  identical.  Monotonicity  of 
convergence  when  V  :  0  follows  by  induction  on  n,  using  the  fact  that  U  is  a 
monotone  operator  and  UV  »  r  ^  0  «  V  . 
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A  queueing-control  application 

Consider  the  following  model  for  control  of  arrivals  to  a  generalized 
queueing  system,  which  includes  several  arrival-control  models  in  the  literature 
as  special  cases  (see  Johansen  and  Stidham  [12]  fot  details).  Customers  arrive 
at  intervals  which  are  independent  and  identically  distributed  as  a  random  variable 
T,  with  Pr{T  >  c}  >  0  for  some  e  >  0.  Each  customer  brings  a  certain  amount  of 
potential  input  to  the  system.  The  potential  inputs  of  successive  customers 
are  independent  and  identically  distributed  as  a  random  variable  S.  At  each 
arrival  instant  the  system  controller  has  the  option  of  accepting  or  rejecting 
Che  entire  potential  input  of  the  arriving  customer.  If  an  input  is  accepted 
then  it  is  added  to  the  quantity,  s,  in  the  system  and  a  net  benefit,  b(s),  is 
earned.  Me  assume  that  b(s)  «*  r  -  C(s),  where  r  is  a  reward,  or  utility  of 
service,  and  C(s)  is  a  waiting  cost,  a  non-decreasing  function  of  s.  If  tin  input 
is  rejected,  the  net  benefit  is  0.  Potential  output  from  the  system  is  governed 
by  an  uncontrollable  stochastic  process  with  non-negative  stationary  independent 
increments,  (H(t),  t  >  0}.  Thus,  at  the  end  of  a  tine  interval  of  length  t,  which 
begins  with  a  quantity  s  in  the  system  and  during  which  no  arrivals  are  accepted, 
the  quantity  in  the  system  is  distributed  as  (s  -  U(t))+.  Future  benefits  are 
continuously  discounted  at  rate  a  >  0. 

In  reference  [12]  the  reward  is  allowed  to  be  a  random  variable,  but  in  all 
other  respects  the  model  is  the  same.  The  model  specializes  to  a  Cl/r,/l  system 
with  quantity  interpreted  as  work  in  the  svstem,  if  S  has  the  distribution  of 
the  service  time  and  tl(-r)  =  t  (see  [33],  [  6]).  If  S  5  1  and  { IT (t ) ,  t  >  0}  is 
a  Poisson  process,  then  the  model  specializes  to  a  CI/’.'/l  system  with  quantity 
interpreted  as  the  number  of  customers  in  the  system  (see  lift],  [31]). 

The  problem  of  maximizing  the  expected  discounted  total  net  benefit  over  a 


finite  or  infinite  horizon  can  be  formulated  as  a  special  case  of  our  Markovian 


decision  model,  In  which  the  state  s  is  the  quantity  found  in  the  system  by  an 
arrival,  the  action  a  =*  1  (0)  denotes  acceptance  (rejection)  of  a  customer, 
r(s,a)  -  a(r-C(s)) ,  and  p(s,a;  B)  -  E[e”aT  l((s  +  a  S  -  N(T))+  e  B) ] ,  where 

m 

a  Is  the  continuous-time  discount  rate  and  1(E)  is  the  Indicator  of  the  event  E. 
The  finite-horizon  and  infinite-horizon  optimality  equations  are,  respectively, 

_oT  + 

V  (s)  =  max  (a(r-C(s))  +  E[e  V  , ((s  +  a  S  -  N(T))  )]}, 

n  _  ,  n— i 

a*»0,l 

n  1  (assume  Vq  5  0) ,  and 

V*(s)  «  max  (a(r-C(s) )  +  Efe-01  v*  ( (s  a  S  -  »(T))  +)  1  )  (2.3) 

a«0,l 

s  e  S  ■  (0,  ») .  It  Is  easily  verified  that  Assumptions  2. 1-2. 7  are  satisfied. 
Hence  Theorem  2.5  applies,  which  Implies  in  particular  that  It  is  optimal  at 
each  stage  n,  1  <,  n  <_  «,  to  reject  tha  arriving  customer  (a  »  0)  If  r  <  C(s), 
that  is.  If  his  individual  net  benefit  is  negative.  The  converse  is  not  generally 
true  (except  for  n  -  1):  an  optimal  policy  may  reject  a  customer  eveu  though 
r  >,C(s),  that  is,  even  though  it  is  in  his  individual  interest  to  join.  (For 

further  discussion  of  this  phenomenon,  see  Johansen  and  Stidham  [12]  and  the 

.  * 
references  cited  therein.)  Theorem  2.5  also  Implies  that  and  V  are  non- 

* 

negative  and  non-increasing  and  that  the  convergence  of  to  V  is  raonotonic 

(V  >  V  .)  as  well  as  uniform  and  geometric, 
n  n-JL 
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3.  Model  II 


In  this  section  we  consider  another  special  case  of  the  general  decision 
model  of  Section  1.  Unlike  the  model  of  the  previous  section,  this  model  does 
not  have  bounded  optimal  value  functions.  It  has  enough  structure,  however,  that 
it  is  still  possible  to  demonstrate  uniform  geometric  convergence  of  the  method 
of  successive  approximations  for  certain  starting  functions.  The  structure  of  the 
model  includes  that  found  in  a  large  number  of  Markov  decision  processes.  Including 
most  of  those  studied  in  the  literature  on  stochastic  service  systems  in  which 
monotonic,  or  critical-numhei ,  policies  are  optimal.  This  structure  makes  it 
possible  to  transform  the  model,  by  means  of  a  shift  function,  into  an  equivalent 
model  satisfying  the  conditions  of  Section  2.  We  illustrate  the  applicability 
of  the  model  by  some  examples  from  queueing  and  inventory  control. 

We  shall  use  the  following  two  conditions  throughout  this  section. 

Condition  3.1.  p;  *>  sup  ||  P(f )  ||  <  1 

fcF  + 

Condlt ion  3.2.  M:  =  sup  ||  r  (f)||  <  00 

frp 

Note  that  Conditions  3.1  and  3.2  are  exactly  the  conditions  of  the  essentially 
negative  case. 

In  this  section  we  shall  be  interested  in  the  Banach  space  W(V) ,  where 
V:  S  -►  R  is  a  given  non-zero  reference  function.  We  shall  use  the  following 
condition  on  V. 

Condition  3.3.  ||  UV  -  V  ||  <  »  . 

An  upper  bound  on  V*  is  a  natural  candidate  for  a  reference  function,  since 
in  many  applications  (see  examples  later  in  this  sect  ion)  it  is  easy  to  compute 
such  an  upper  bound.  Since  Conditions  3.1  and  3.2  imply  that  V*  ^  *  M(l-p) 

it  also  makes  sense  to  confine  our  attention  to  reference  functions  V  >_  V*  such 
that  V  <  . 


Theorem  3.1  ■  Let  V  «*  V,  where  V*  V  <  .  Assume  Conditions  3.1  -  3.3.  Then 

V*  e  U(V)  and  is  the  unique  solution  in  W(V)  to  the  optimality  equation 

v  *  Uv 

and,  for  any  V  e  t"(V) ,  |(  V*  -  U°V  ||  <  pn  j|  V*  -  V  ||  <  »,  so  that  lA 

o  o  o  o 

converges  to  V*  uniformly  and  geometrically. 

Proof.  Condition  3.1  implies  (ii)  of  Lemma  1.1.  It  remains  to  verify  (i)  of 
Lemma  1.1:  V*  c  CJ(V)  . 

To  this  end,  we  first  claim  that  it  suffices  to  show  pointwise  convergence 
of  successive  approximations  with  V  as  starting  function- 

UnV  (3.1) 

To  see  this,  note  that  (ii)  of  Lemma  1.1.  and  Condition  3.3  imply  that  (s  e  S) 

UnV(s)  -  V(s)  ||  Uk+1V  -  UkV  || 

\l\  Pk  II  uv  -  v|| 

<  (1  -  p)'1  ||uv  -  V  II  <  - 

Letting  n  -*  ®  and  assuming  (3.1),  we  conclude  that  (s  c  S) 

V*(s)  -  V(s)  <  (1  -  p)_1  ||  UV  -  V||  <  «  . 

Reversing  the  roles  of  UnV(s)  and  V(s)  yields  the  same  uniform  upper  bound  for 
V(s)  -  V*(s),  so  that 

||  V*  -  V  II  <  (1-  p)'1  II  -  v||  < 
that  is,  V*  e  W(V) . 

Thus  it  remains  to  prove  (3.1).  Define  V^;  =  lim  Uv.  Since  V  >  V*, 

n-** 

U11^  >  Uv*  ■  V*,  so  that  >  V*.  Hence  it  suffices  to  show  that  lim  UnV  <_  V* 

rr*-® 

To  this  end,  we  first  show  that  Conditions  3.1  -■  3.3  imply  that 
v»  -  £4s  Unv  and  that  Vm  is  a  fixed  point  of  U.  Although  a  direct  proof  is 
possible,  we  shall  Instead  prove  these  facts  by  means  of  a  shift  transformation, 
which  converts  the  problem  into  an  equivalent  problem  satisfying  the  conditions 
of  Model  I.  This  shift  transformation  is  of  independent  interest.  In 
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particular,  It  has  an  interesting  economic  interpretation,  which  we  shall  discuss 
later  in  this  section. 

For  each  decision  rule  f,  define  f(f)  :  =  r(f)  +  P(f)V  -  V. 

Consider  the  Markov  decision  model  (S,A,D,?,p).  Condition  3.3  implies  that: 

|j  sup  ?(f)  ||  =  (|  sup{r(f)  +  P(f)V  }-  V  |j 
feF  fcF 

-  li  uv  -  V  II  <  ”, 

so  that  Condition  2.2  holds.  Condition  3.1  is  identical  to  Condition  2.1.  Define 

the  operator  U,  where  it  is  well  defined,  by 

Ov:  =  sup-'  r(f)  +  P(f)v  }. 
fcF 

Define  the  inf  ir  Kt-uot  !  .;on  optimal  value  function  V*  for  the  transformed  model  in 
the  obvious  way  (cf.  Section  1).  Then 
V*  =  uv*. 

Moreover,  since  the  transformed  model  satisfies  conditions  2.1  and  2.2,  it  follows 

from  Theorem  2.1  that  V*  is  the  unique  bounded  fixed  point  of  U  and  that 

V*  =  lim  i/V  (3.2) 

n-*«> 

Now  observe  that 

UO  =  sup  r(f) 
fcF 

=  sup{  r(f)  +  P(f) V  }  -  V 
fcF 

=  UV  -  V, 
and  by  induction  on  n, 

U°0  *■  U1^  -  V,  n  >  1.  (3. 3) 

From  (3.2)  and  (3,3)  we  conclude  that 
V  *  lim  UnV  =  V*  +  V. 

ao 

ar+°° 


Hence 
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UV^  •-=  U(V*  +  V) 

=  sup  {r(f )  +  P(f)  (V*  +  V)  } 
fc'F 

*»  sup  (r(f)  +  P(f)V  -  V  +  P(f)V*  }  +  V 
f  eF  - 

=  sup  {r(f )  +  P(f)V*  }  +  V 
fsF 

-  Ov*  +  v 

=  V*  +  V 

=  V 

ao 

» 

that  Is,  Is  a  fixed  point  of  U,  the  desired  result. 


Let  e>0  be  given.  Choose  a  stationary  policy  f  such  that 
V  =  UV  <  L(f )  V  +  e  . 


(3.4) 


(See  Remark  3.1  below.)  Iterating  on  L(f)  and  using  Condition  3-1  ,  we  have 

n-1  . 

v.  <  (L(f))nvtD  +  z  P  e 

k“0 

=  (L(f))n0  +  (P(f))°V  +  pk  e 

00 

k*=0 

30  that 

Vm  <  V(f)  +  XlS  (P(f))nVoo  +  e(l  -p  )-1 
n-*“ 


<  V*  +  11m  (P(f))nVoo  +  e(l  -  p)"1. 


n-*°* 

From  Conditions  3.1  and  3.2  and  the  hypothesis  that  V  ^  it  follows  that 

v®  *  11m  U°V  <  >L,  so  that 

n-*“ _ 

11m  (P(f))"v  <  lim  P°  M,  «■  0. 

n-»»  n-^° 

Hence,  since  e  was  arbitrary,  we  conclude  that  V  <  V*,  the  desired  result. 


Remark  3.1.  The  existence  of  a  policy  f  satisfying  (3.4)  follows  from  the 
existence  in  general  of  e  -  optimizing  decision  rules,  which  is  a  mild 
regularity  condition,  apparently  satisfied  in  all  practical  problems.  It  holds, 
for  example,  when  the  state  space  is  countable.  For  general  state  space,  where 
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it  is  customary  to  require  decision  rules  to  be  measurable  functions  in  order  for 
the  relevant  stochastic  processes  and  integrals  to  be  well  defined,  there  may 
be  difficulties  in  applying  the  condition  to  certain  functions  v,  such  as  v  =  V  , 
which  may  not  themselves  be  measurable  (cf.  Blackwell  [  3  ),  Strauch  [34 ] .Hinderer 
[11]).  There  is  no  problem,  for  example,  if  the  action  space  is  countable  or  if 
continuity  -  compactness  conditions  are  satisfied  by  (S,A,D.r,p)  (see,  e.g., 

Schhl  [24]).  Alternatively,  one  may  enlarge  the  class  of  admissible  decision 
rules  to  include  all  universally  measurahle  functions  f  •  S  -*■  A  such  that 
f (s)  e  D(s) ,  s  e  S,  and  require  that  r,p,  and  v  also  be  universally'  measurable 
(cf.  Shreve  and  Bertsekas  [27]). 

Remark  3.2.  It  can  easily  be  shown  that  Condition  3.3  and  (3.1)  together  are 
necessary  as  well  as  sufficient  for  (i)  of  Lemma  1.1.  In  other  words: 

V*  c  IJ(V)  iff  ||  UV  -  V  ||  <  »  and  u"v  -*•  V*. 

We  now  give  several  examples  of  applications  of  Model  II  to  problems  in 
control  of  queues,  followed  by  some  suggestions  for  a  general  approach  to  the 
solution  of  such  problems.  Finally,  we  show  how  Model  II  can  also  be  applied  to 
certain  per iodic- review  inventory  models. 

Example  1°  .  Control  of  Arrivals. 

Again  we  use  the  model  of  Johansen  and  Stidham  [12]  for  control  of  arrivals 
to  a  stochastic  input-output  system  as  a  vehicle  for  illustrating  the 
verification  and  application  of  our  conditions,  (see  Section  2  for  a  detailed 

introduction  to  the  model.)  In  the  present  context  we  are  interested  in  the 

case  of  continuous,  rather  than  lump-sum,  charging  of  the  holding  cost.  To  be 
specific,  the  system  is  observed  at  arrival  points,  the  state  s  is  the  quantity 
in  the  system  found  by  an  arrival,  the  action  a  indicates  acceptance  (a  =  1)  or 
rejection  (a=  0)  of  the  potential  input  of  the  arrival,  the  return  function 
r(s,a)  is  given  by: 


-22- 


r(s,a)  =  ar  -  E{/  e_0Uh((s  +  aS  -  N(x))+)  dr  ], 

o 

and  the  transition  probability  by 

p(s,a;B)  =  E[e“aT  1  ((s  +  aS  -  H(T))+e  !i)  ] . 

Here  h(‘)  gives  the  rate  (per  unit  time)  at  which  holding  cost  is  incurred*  as  a 
function  of  the  quantity  in  the  system.  He  assume  that  h(‘)  is  a  non-negative, 

non-decreasing,  convex  mapping  from  S  into  R. 

Thus  the  model  differs  from  the  one  considered  in  Section  2  only  with  respect 
to  the  return  function.  In  the  model  of  Section  2,  there  is  a  waiting  cost,  C(s), 
associated  with  an  arriving  customer  who  joins  when  the  state  is  s.  This  cost 
might  represent  the  expected  discounted  cost  that  the  customer  will  incur  during 
the  entire  time  he  spends  waiting  in  the  system.  Note  that  it  is  a  cost 
associated  only  with  the  joining  customer,  but  that  it  reflects  time  spent 
v'aiting  after  as  well  as  until  the  next  arrival  point.  By  contrast,  in  the 
present  model  the  return  function  includes  costs  associated  with  all  customers 
in  the  system,  but  only  until  the  next  arrival.  The  relation  between  the  two 
charging  schemes  is  given  by 

QO 

C(s)  =  E [/  e"aT  [h( (s  +  S  -  N(t))+)  -  h((s  -  N(x))+]  dr].  (3.5) 

o 

It  follows  from  the  assumption  that  h(*)  is  non-decreasing  and  convex  that  C(') 
defined  by  (3.5)  is  non-negative  and  non-decreasing,  as  required  in  the  model 
of  Section  2. 

(For  further  discussion  and  economic  interpretation  of  the  two  charging  schemes, 
see  Johansen  and  Stidham  { 12l • ) 

The  infinite-horizon  optimality  equation,  V*  =  UV*,  for  the  problem  with 

continuous  charging  is  given  explicitly  by  (s  >  0) 

V*(s)  «  max  (ar  -  E[}  e  aT  h((s  +  aS  -  N(x))+)  dx] 
a“0,l  o 

+  E{e~aTV*((a  +  aS  -  H(T))+)]  . 


It  is  easily  verified  that  Conditions  3.1  and  3.2  hold.  In  fact,  V*  <  = 

r(l-p)  <  <*>,  where  p  =  E[e  *]  <  1. 

Define  V  :  S  ■*  R  by 

V (s) :  =  r (1  -  p)-i  -  Elf  e~  T  h((s-N(T))  )  dx]  (s  e  S)  (3.7) 

o 

Theorem  3.2.  Consider  the  arrival-control  model  with  continuous  charging  of 
holding  cost  and  with  V  defined  by  (3.7).  Then  V*  <  V  <  M^<  ®  ,  V*  e  W(V)  and 

is  the  unique  solution  in  W(V)  to  the  optimality  equation  (3.6).  Moreover,  for 
any  Vq  c  l'J(V),  ]|  V*  -  UnVo  |J  <  pn)j  V*  —  V  ||  <  SO  that  U  V  converges  to 

V*  uniformly  and  geometrically. 

Proof.  Let  n  be  an  arbitrary  policy,  noting  that 

TV  00 

V*(s)  -  E  [Z  r(Xr,A  )}  (3.8) 

S  t=0  "  c 

P  CD  °° 

=  E  [Z  rA  ]  -  i:n  [Z  E[;Te_aT  h((X  +  A  S  -  II(t))+)dT|x  ,A  ]] 
s  t=0  C  s  t=0  o  t  1  C  C 

The  first  term  on  the  right-hand  side  of  (3.8)  is  maximized  by  the  (stationary) 
policy  that  accepts  all  customers,  whereas  the  second  term  is  maximized  by  the 
policy  that  minimizes  holding  costs,  namely,  the  (stationary)  policy  fr  that 
rejects  all  customers  (f  (s)  =  0,  for  all  s  e  S) .  Note  that 

CO 

vfr(s)  =  -El /  e~  n  h( (s-N(t) )+) dx ]  (s  c  S)  (3.9) 

o 

since  under  fr  no  rewards  are  earned,  but  all  the  quantity  s  currently  in  the 
system  must  be  processed  and  will  incur  holding  costs  until  it  is  processed. 

It  follows  then  from  (3.7) ,  (38),  and  (3.9)  that 

V*  <  Mj  +  Vfr  =  V  (3.10) 

f  _ 

since  ir  was  arbitrary  and  V  r  <  0,  we  conclude  that  V*  <_  V  <  00  . 
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To  complice  the  proof  it  suffices  to  show  that  Condition  3.3  holds,  30  that 
Theorem  3.1  applies.  To  this  end,  first  observe  that  V^r  satisfies  the  functional 
equation. 

V f r ( s )  =  -E[/V"Th((s-U(T))+)dT]  +  E[e~aT  Vf r ( (s-N(T) )+) ]  (s  e  S)  (3.11) 
(This  follows  from  (1.1)  and  (3.9).)  Note  also  that  (3.5)  and  (3.9)  imply  that 

C (s)  =  -E[V£r(s  +  S) ]  +  V£r(s).  (3.12) 

Now,  using  (3. 10) ,  (3. 11) ,  and  (3.12),  we  can  write 

_  _  T  _aT  + 

UV(s)  -  V(s)  =  max  {ar  -  E [/  e  h((s+aS  -N(t))  )df] 
a=0,l  0 

+  E[o"aTV((s  +  nS  -  tl(T) )  +  )  }  -  V (s) 

=  max  {ar  -  E [/  e  aTh((s+aS  -  N(t))  )dt] 

a=0, 1  0 

+  E[e_uTVfr((s  +  aS  -H(T))+  ]  -  Vfr(s)  -  (l-p)M1 

=  max  {ar  -  E[V£r(s  +  aS))  }-V£r(s)  -r 
a=0,l 

»  max  {r  -  C(s).  1  )  -  r.  (3.13) 

Since  C(-)  is  non-decreasing,  it  follows  from  (3.13)  that 

-r  <  UV(s)  -  V(s)  <  0  (3.14) 

and  hence  ||  UV  -  V  ||  <  r  <  “  ,  so  that  Condition  3.3  holds.  This  completes 

the  proof  of  the  theorem. 

Remark  3.3.  Since  UV  <  V  (by  (3.13)),  it  follows  by  induction  from  the 
monotonicity  of  U  that  U1^  4  Un-1V,  for  all  n  ^  1,  so  that  the  convergence  of 
UnV  to  V*  is  monotonically  decreasing. 

Corollary  3.3.  Consider  the  arrival-control  model  with  continuous  charging  of 

holding  cost.  Suppose  V  «*  V£r  <  V*.  Then  V*  e  ()(V)  and  is  the  unique  solution 

in  (V(V)  to  the  optimality  equation  (3.5).  Moreover,  for  any  V  e  W(V) , 

o 

||  v*  -  UnVQ  ||  <  pn||  V*  -  V  ||  <  »  ,  so  that  UnVQ  converges  to  V*  uniformly  and 


geometrically. 


Proof.  A  direct  consequence  of  Theorem  3.2  since  (3.10)  implies  that  ||  V  -  V^r  j| 
a  M^<  <*>  ,  where  V  is  defined  by  (3.7). 

fr  f  r  f , 

Remark  3.4.  Since  UV  r  >  L(fr)V  =»  V  r,  it  follows  by  induction  from  the 
monotonicity  of  U  that  >  UD  ^V,  for  all  n  >  1,  when  V  =  V^r.  Hence  the 
convergence  of  U°V  to  V*  is  monotonically  increasing.  In  fact,  of  course, 
convergence  of  UnV  to  V*  is  monotonically  increasing  whenever  V  is  the  infinite- 
horizon  value  function  for  a  particular  policy.  This  observation  is  true  in 
general  of  "successive  approximations  in  policy  space  and  goes  back  at  least 
to  Bellman  [1], 

Remark  3.5.  Define  a  new  return  function  ?  by  r(s,a):  =  r(s,a) 

+  E[e  V  r  ((s  nS  -  U(T))  )]  -  V^r(.,)  (;i  c  S,  .  =  0,1)  or,  equivalently, 

r(f):  =  r(f)  +  P ( f ) V ^ r  V^r,  for  each  decision  rule  f.  In  fact,  the 

proof  of  Theorem  3.2  shows  that 

r(s,a)  =  a(r  -  C(s))  (s  t  S,  a=  0,1) 

so  that  the  Markov  decision  model  (S,A,D,?,p)  has  the  struck  *<:  5f  the  arrival- 
control  problem  with  lump-sum  charging  of  waiting  costs,  which  was  studied  in 
Section  2.  The  effect  of  this  shift  transformation  is  to  subtract  V^r  from  the 
value  function  for  each  policy  and  hence  from  the  optimal  value  functions  for 
both  finite  and  infinite  horizons.  In  economic  terms,  -Vfr(s)  is  just  the 
expected  discounted  cost  of  holding  the  quantity  s  until  it  is  processed  and  hence 
represents  an  unavoidable,  or  fixed,  cost  that  must  be  incurred  no  matter  what 
policy  is  followed.  The  shift  transformation  thus  can  be  interpreted  as  a  removal 
of  these  fixed  costs,  so  that  only  those  costs  that  vary  with  the  policy  -  namely, 
those  associated  with  the  current  and  future  customers  -  are  included  in  the 
value  functions.  That  such  a  transformation  should  lead  to  an  equivalent  decision 
problem  is  therefore  plausible  on  intuitive  grounds. 
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This  shift  transformation  was  first  proposed  for  an  arrival-control  problem 
in  Lippman  and  Stidham  [16].  In  that  context,  however,  it  was  used  for  a 
different  purpose,  namely,  facilitating  the  comparison  of  an  optimal  policy  with 
the  (equilibrium)  policy  followed  when  each  customer  acts  to  maximize  his  own 
expected  discounted  net  benefit;  a(r  -  C(s)).  (Note  that  an  equilibrium  policy 
is  myopic  with  respect  to  the  transformed  model  (S,A,D, r ,p) . ) 

An  equilibrium  policy  can  also  be  used  to  generate  an  alternative  reference 
function  V  for  the  model  with  continuous  charging.  Denote  by  fc  the  (stationary) 
equilibrium  policy;  that  is,  f(,  accepts  a  customer  (fg  <=  1)  in  State  S  iff 
r  >  C(s).  Since  the  optimal  value  function,  V*,  for  the  transformed  model 
(S,A,D,?,p)  satisfies  (2.3),  it  follows  that  an  optimal  policy  accepts  a  customer 
iff  r  >  C(s)  +  E[e"aT(V*  ((s  -  N(T))+)  -  V*  ((s  +  S  -  N(T))+))].  It  can  easily 
be  shown  (see  Johansen  and  Stidham  [12])  tlyjt  V*(-)  is  non- increasing.  Hence  an 
optimal  policy  accepts  a  customer  in  state  s  only  if  the  equilibrium  policy  fe 
accepts  in  s  (cf.  also  Lippman  and  Stidham  [16]  and  the  references  cited  therein 
for  other  instances  of  this  phenomenon).  This  property  can  be  used  to  prove 

Theorem  3.4.  Consider  the  arrival-control  model  with  continuous  charging  of 

holding  cost.  Suppose  V  =  V^c  <  V*.  Then  V*  e  W(V)  and  is  the  unique  solution 

in  W(V)  to  the  optimality  equation  (3.61  Moreover,  for  any  V  e>  W(V) , 

o 

||  V*  -  II  <,  f)tl  H  V*  -  Vq  ||  <  ”  ,  so  that  UnV£)  converges  to  V*  uniformly 

and  geometrically. 

Proof.  We  shall  verify  that  U11!?  •>  V*  and  that  Condition  3.3  holds.  The  remainder 
of  the  proof  then  parallels  that  of  Theorem  3.1.  (See  Remark  3.2.) 

Let  f1'  denote  an  optimal  policy.  Since  f*  accepts  only  if  f0  accepts,  it 
follows  that  for  each  n  >  1  X  is  stochastically  smaller  under  f*  than  under 

fg 

given  X  “  s.  Thus,  since  V  (•)  is  non- increasing,  we  have(cf.  Lemma  2.3) 


^P(f>‘))nVft'  =  E  f *  [Vf°(X  )1 

n 

>  Efe  (Vf e(X  )]  -  (P(fe))Ve 


(3.15) 


Vf°  =  L(ft)Vfe  - 

-  (L(fe))nO  +  (P(fe))nVffe 

and (L(fe))0  -*•  Vf'c,  asn>-  ,  by  (1.2),  so  that  (P(fe))DVfe  -*•  0,  as  n  -*• 
Using  this  fact  together  with  (3.15),  we  conclude  that 
11m  (P(f*))nVfc  >  11m  (P(fe))°Vfe  -  0. 


(3.16) 


Now  it  follows  from  (3.16)  and  Theorem  3.  S of [32 )  that  U^V^0  -►  V*. 

To  verify  Condition  3.3,  first  observe  that  (s  e  S) 
f  T 

UV  e(s)  =  max  (ar  -  E [f  e~aT  h( (s+aS-N(x))+)dT]  +  E[e_aTVfc‘(a  haS-N(T)  )+)  I  } 
a=0,l  0 


where. 


°  max  (r  +  E[V.'(s+S)],  W(s) ) 

°  u(s)  +  (r  -  w(s))  1  (w(s)  <  r) 

l:  -  -  E[/VaT  h((s  -N(t))+)  dt]  +  E[e~aTvf^  ((s-N(T))+)J 


(3.17) 


w(s).  =  V’(s)  -  E[W(s+S)  ] . 

On  the  other  hand, 

Vf^(s)  .  L(fe)  Vfe(s) 

=  (r+E[W(s+S) ])  1  (C(s)  <  r)  +  W(s)  1  (C(s)  >  r) 

=  W(s)  +  (r  -  w(s) )  i  (c(s)  <  r)  (3.18) 

It  follows  from  (3.17)  and  (3.18)  that 

0  <  UVf°(s)  -  Vfe(s)  y 

=  (r-w(s) )  1  (w(s)  <  r  <  C(s))  +  (w(s)-r)  1  (C(s)  <  r  <  w(s)).  (3.19) 

It  remains  to  show  that  UV^e  -  V  e  Is  uniformly  bounded  above.  To  this 


end,  first  observe  that 


Vfr  <s)  <  (3)  <  Mx  +  (s)  . 


(3-20) 


The  right-hand  inequality  follows  from  (3.10).  The  left-hand  inequality  can  be 
verified  as  follows.  First,  note  that  L(fi.)vfe(s)  -  ^(r  »  C(s))  (r+EtiA  (s+S)  ]) 
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+  l(r  <  C(s))  Vfr(s)  =  Vfr(s)  +  l(r  >  C(s))  (r  -  C(s))+  >  Vfr(s).  (.'.ere  we  have 
used  (3.12).)  Hence,  iterating  and  using  the  monotonicity  of  the  operator  L(f^), 
we  have  (n  >  1) 

Vfr  <  (L(f0)Vr  =  (L(f J)n0  +  (P(fc))“vfr. 

Letting  n  ■*  00  and  using  the  fact  that  V^r  <  0,  we  have 
Vfr  <  Vfe  +  (P(fc))nVfr  <  Vfc. 

n-^o 

Now  it  follows  from  (3.11)  and  the  definition  of  W(s)  that 

W(s)  =  Vf r(s)  +  E(e_‘r(Vfe((s  -  N(T))+)  -  Vff((s  -  N(T))+))J. 

Hence,  from  (3.20)  we  obtain 

Vfr(s)  <  W (s)  <  Vf r(s)  +  pM1, 
which  imples  (using  (3.12))  that 

-pMj  +  C(s)  <  w(s)  <  C (s)  +  pll^ 

Using  tlu  sc  inequalities  together  with  (3.19),  we  find  that 

0  <  UVf^(s)-Vft(G)  <  (r-C(s)+pM1)  1  (w(s)  <  r  <  C(s)) 

+  (C(s)-r+pM^)  1  (C(s)  <  r  <  w(s)) 

<  ,'.M^ [  1  (w(s)  <  r  <  C(s) )  +  1  (C (s)  <  r  <  w(s))j 

<  c:;1<  »  , 

thus  verifying  Condition  3  3. 

Remark  3.6.  The  proof  of  Theorem  3.4  could  be  simplified  considerably  if  one 
could  assert  that  w(s)  is  non-decreasing  and  w(s)  >  C(s).  The  first  property 
says  that.  If  one  is  free  to  choose  the  action  in  the  current  period  but  must 
follow  f^  thereafter,  then  the  optimal  action  is  non- increasing  in  s.  The 
second  property  nays  that,  under  tin  same  M rcumetances,  the  optimal  action  will 
be  to  reject  (a  *  0)  whenever  the  equilibrium  policy  rejects  (fu(s)  =  0),  and 
perhaps  in  other  states  as  well.  (Analogous  properties  hold  when  one  considers 
an  optimal  rather  than  equilibrium  policy;  see  [12  ].)  However,  we  have  not 
been  able  to  find  proofs  for  either  property  and  conjecture  that  they  do  not 
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hold  in  general.  Intuitively,  if  one  must  follow  an  equilibrium  policy  after 
the  current  period,  then  it  might  be  optima]  to  accept  in  the  current  period 
even  if  an  equilibrium  policy  would  reject,  since  the  resulting  state  at  the  next 
arrival  might  then  be  large  enough  to  force  the  next  customer  to  balk  rather  than 
join,  which  in  turn  makes  it  possible  for  the  customer  after  that  to  join  rather 
than  balk.  The  net  increase  in  total  welfare  could  be  positive  in  some  cases. 

Example  2°.  Control  of  the  Service  Rate 

As  a  further  illustration  of  how  our  theorems  can  be  applied  to  problems 
in  queueing  control,  we  consider  an  M/M/1  queue  with  vurlnhlo  «sf*vloe  rate.  (See 
Stidham  and  Prabhu  [33],  Sobel  [28],  or  Crabill,  Cross,  and  Magazine  [4  ]  for 
detailed  discussion  of  this  model  and  relevant  references.)  Let  the  arrival 
rate  X  >  0  be  fixed.  The  service  rate  y  can  be  chosen  from  the  interval  [0,  y] . 
Uhenever  service  rate  y  is  in  effect,  the  system  incurs  a  cost  c(u)  per  unit 
time,  where  c(")  is  non- decreasing  and  continuous  with  c(0)  =  0.  A  holding 
cost  is  incurred  at  rate  h(i)  whenever  i  customers  are  in  the  system,  i  >  0, 
where  h(")  is  non-negative  and  non-decreasing.  Costs  are  discounted  continuously 
at  rate  a  >  0. 

To  construct  a  Markov  decision  model  for  this  problem,  we  use  the  new 
device'  of  Lippman  [15]  and  observe  the  system  at  points  of  arrival  (occurring 
at  rate  A),  service  completion  (occurring  at  rate  y) :  and  null  events  (occurring 
at  rate  y-y) .  The  time  between  observation  points  (stages)  has  exponential 
distribution  with  parameter  A:  =  A  +  p  .  The  system  is  said  to  be  in  state  i 
(where  1  is  a  non-negative  integer)  whenever  there  are  i  customers  present.  The 
action  taken  at  an  observation  point  is  the  service  rate  y,  which  remains  in 
effect  until  the  next  observation  point.  Thus,  D(i)  =  [0,y  ],  i  >  1,  and 
D(0)  =  (0}  .  The  one-stage  return  function  r(i,y  )  is  given  by 


r(i,  y)  =  (  a  +  A)  1  [  -c(y)  -  h (i )  ] 
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and  the  transition  probability  measure  is  determined  by  the  discrete  (discounted) 

transition  probabilities, 

P1j(p)‘  *  Pr  (Xm=  j  |  Xt  =  i,  At  =  p} 

=  (a  +  A)-1  [  Al(j  =1+1)  +  pl(j  =  i-l)  +  (Z-u  )  1  (j-i)J. 

Conditions  3.1  and  3.2  are  satisfied,  with  A(A  +  e)  *  and  11=0.  Define 

V*(i)  (i=0,l,...)  as  the  optimal  value  function  for  the  infinite-horizon  nroblem. 

V*  satisfies  the  optimality  equations 

V*(i>  =  (A+a)_1  max_  (-c(p)  -  h(i)  +AV*(i+l)  +  pV*(i-l)  +  (p-u)V*(i)>  , 

0<  u  <P 

i  >  1  (3.21) 

V*(0)  =  (A +i)~1  {-h(0)  +  AV*(1)  +uV*(0)  ) 

Define  the  stationary  policy  g  by  g(i)  =  u,  i  =  1,2,....  We  call  g  the 

full  -service  policy.  Among  all  policies,  g  obviously  minimizes  the  infinite- 

horizon  expected  discounted  holding  cost.  Hot;  define  the  function  V:  S  ■+  R  by 

V(i) ;  =  (a  +  A)'1  E®[?  h(X  )],  i  >  0  (3.22) 

1  t-0  C 

Theorem  3.5.  Consider  the  H/II/1  service  rate-control  model,  with  V  defined  by 

(3.22).  Then  V*  <  V  <  0,  V*  f.  ,!(V)and  is  the  unique  solution  in  W(V)to  the 

optimality  equations  (3.21).  Moreover,  for  any  t  (7(V) ,  ||  V*  -  u'Vq  jj  ^ 

pn|!  V*-V  ||  <  ;  so  that  U*V  converges  uniformly  and  geometrically  to  V*. 

o  o 

Proof.  Let  tt  be  an  arbitrary  policy  and  note  that 

V\i>  =  E17 (  Z  r(X  A  )  ]  (3.2  3) 

i  t=Q  c 

=  -(a  +  A)”1  E  [  l  c(A  )]  -  (a+A)"1  E  [  I  h(X  )],  i  >  0 

i  t=0  C  i  t=0  t 

The  first  term  on  the  right-hand  side  of  (3.23)  is  non-positive  and  the  second 
term  is  maximized  by  the  full-service  policy  g,  since  g  minimizes  the  infinite- 
horizon  expected  discounted  holding  costs.  Therefore,  V71  <  V,  Since  it  was 
arbitrary  we  conclude  that 
V*  <  V  <  0. 
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To  complete  Che  proof.  It  suffices  to  show  that  Condition  3.3  holds,  so  that 


Theorem  3.1  may  be  applied, 


To  this  end,  first  observe  that  V  satisfies  the 


functional  equations 

V(i)  -  (a+A)_1{-h(i)  +  AV(i+l)  +  wT(i-l)},  i>  1 
V(0)  ■=  (a+A) _1{-h(0)  +  XV(1)  +  yV(O)} 

Therefore,  UV(0)  -  V(0)  "  0  and,  for  i  >  1, 

UV(i)  -  V(i)  -  (a+A)'1  [max_  {-c(p)-h(i)+  AV(i+l)  +  pV(i-l)  +  (w-u)V(i) 

0<p<ii 


-  { — h( i)  +  AV(i+l)  +  pV(i-l>  >] 

<  (a+A)  1  max  {(u-u)  (V(i)  -  V(i-l))} 
0<^<U 


Since  V(i)  <  V(i-l),  for  all  i  >  1,  the  qu-'-'tity  in  brackets  is  maximized  by 
setting  u  =  V.  Hence, 

UV(i)  -  V(i)  <0,  i  >  1.  (3.24) 

On  the  other  hand, 

UV(i)  -  V(i)  >  L(g)  V(i)  -  V(i)  =  -  c(w),  i  >  1. 

Therefore,  Condition  3.3  holds  and  the  theorem  is  proved. 


Remark  3.7.  Since  UV  <.  V  (by  (3.24)),  it  follows  by  induction  using  the 
monotonicity  of  U  that  U11^  <  Un  1V,  for  all  n  >  1,  so  that  the  convergence  of 
u*v  to  V*  is  mono ton ically  decreasing. 


p 

Corollary  3.6.  Consider  the  M/M/1  service-rate-control  model.  Suppose  V=V  <  V*. 
Then  V*  e  W(V)  and  is  the  unique  solution  in  W(V)  to  the  optimality  equations 
(3.21).  Moreover,  for  any  VQ  e  W(V) ,  |j  V*  -  i/^VqH  <.  pn||  V*  -  V0||  <  ®  ,  so 
that  converges  uniformly  and  geometnically  to  V*. 


Proof.  It  follows  from  (3.22)  and  (3.23)  that 

V(i)  >  V8(i)  >  a'1  c(u)  +  V(i),  i  >  0. 

Thus  the  corollary  follows  immediately  from  Theorem  3.5. 

Remark  3.8.  Since  UV8  L(g)  V8  *  V8,  it  follows  by  induction  using  the 
monotonicity  of  U  that  u'V8  >  1)°  1V8,  for  all  n  >  1,  so  that  the  convergence  of 


lAr8  to  V*  is  monotonically  Increasing. 
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Remark  3.9.  Define  a  new  return  function  f  by  (i  ■  0,1,...,  0  <  p  <  ji  ) 

CD 

r(i.p):  =  r(i, p)  +  I  P^p)  V8(j)  -  V8(i), 

J“0 

or,  equivalently, 

r(f):  -  r(f)  +  P(f)  V8  -  VS 

for  each  decision  rule  f  e  F.  By  an  argument  similar  to  that  used  in  the  proof 
of  Theorem  3.5,  it  follows  that 

r(i,p)  =  (a+A)_1  [c (p)  -  c(p)  -  (p  -p)  (V8(i-1)  -  V8(i))], 

i  >  1,  pc  ( 0,u]  (3.25) 

r (0,0)  =  0. 


From  (3.25)  we  see  immediately  that 

C  <a  max  r(i.ii)  <  (crt-A)  *  c(p), 

o<u<Z 

so  that  the  Markov  decision  model  (5,A,D,r,p)  satisfies  the  conditions  of  Model  I 
studied  in  Section  2.  The  effect  of  this  shift  transformation  is  to  subtract 
V  from  the  value  function  for  each  policy  and  hence  from  the  optimal  value 
function  for  both  fin  te  and  infinite  horizons.  Thus,  the  value  functions 
for  all  policies  are  measured  relative  to  that  of  an  extremal  (in  this  case, 
full-service)  policy.  Like  the  policy  that  rejects  all  customers  in  the 
arrival-control  problem,  this  full-service  policy  minimizes  holding  costs  among 
all  policies.  Thus,  the  holding  costs  under  policy  g  —  that  is,  -V(i)  as 
defined  by  (3,22)  —  are  unavoidable  fixed  costs  that  must  be  incurred  no 
matter  what  policy  we  follow.  (Other  policies  may  incur  larger  holding  costs, 
of  course.)  Our  results  say  that  the  original  service-rate-control  model  is 
equivalent  to  one  in  which  these  fixed  costs  have  been  eliminated,  that  is,  a 
model  with  r(i,p)  as  its  one-otage  return  function. 


Examination  of  (3.25)  reveals  that  r(i,p)  can  be  interpreted  as  the  net 
savings  from  taking  action  u  rather  than  p  in  the  current  period,  if  policy  g 
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is  to  be  followed  In  all  future  periods.  If  the  maximum  net  savings  is  Zero, 
that  is,  if 

(u-p)-1[c(p)  -  c  (P) ]  <  V8(i-1)  -  V8(i),  for  all  i  >  1, 

then  a  full-service  policy  is  optimal  (cf.  Sobel  [29],  Schleef  [25],  Sobel  and 
Winston  [30]).  Thus,  in  addition  to  providing  a  convenient  vehicle  for 
demonstrating  uniform,  geometric  convergence  of  successive  approximations,  the 

P 

transformed  model  with  V  as  shift  function  is  in  some  sense  the  "proper  setting'' 
in  which  to  investigate  the  form  of  an  optimal  policy.  We  have  already  made 
a  similar  observation  in  the  context  of  the  arrival-control  model,  and  we  believe 
the  observation  to  be  valid  in  the  majority  of  control  models  for  stochastic 
service  systems  and  related  systems. 

Remark  3.10.  A  close  examination  of  the  proofs  of  Theorems  3.2  and  3.5  will 
reveal  that  in  both  cases  the  proof  of  Condition  3.3  for  V  *  V  hinges  on  the 
fact  that  the  function 

h(s,a)  :  -  L(fa)  V(s) 

=  r (s,a)  +  /p(s,a,ds' )  V(s’) 

* 

is  supermodulir  1 36) »  1 35),  [ 33]  in  (s,a),  where  fa  is  the  stationary  policy  that 
always  takes  action  a  (fa(s)  =  a,  for  all  s  e  S).  This  property,  together 
with  the  facts  that  V(s)  is  nonincreasing  in  s  c  S,  S  contains  a  minimal  element 
0,  and  V  (in  both  examples)  is  related  to  the  value  function  for  an  extremal 
policy,  leads  directly  to  uniform  upper  and  lower  bounds  on  UV(s)  -  V(s)  and 
thence  to  Condition  3.3.  Supermodularity  is  extensively  used  as  a  device  for 
proving  monotonicity  of  optimal  control  policies  [2f>],  [23],  [31],  [12].  We 
expect  that  it  could  also  be  used  as  the  basis  for  a  general  model  of  queueing- 
control  problems  for  which  methods  like  those  used  in  Theorem  3.1  could  be 
used  to  demonstrate  uni£<  rm  geometric  convergence,  of  successive  approximations. 
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Example  3° ■  Inventory  Cont rol . 

This  example  will  he  used  to  give  three  variants  of  the  classical  single¬ 
product  inventory  problem  with  periodic  review,  as  described,  e.g.,  by  Scarf  [23). 
We  shall  allow  for  unbounded  returns  and  not  require  convexity  conditions  for 
the  cost  or  rewird  structure.  Let  us  first  describe  the  general  model. 

An  inventory  system  is  observed  at  discrete  points  in  time,  say  the  her, inning 
of  each  month.  At  these  points  in  time  the  state  of  the  system  fc  defined  as 
the  available  inventory.  This  inventory  may  be  negative  as  well  as  positive, 
so  backlogging  is  allowed.  Let  us  represent  the  state  space  bv  S  =  R  =  (-“,  *) . 

If  the  state  scS  is  observed  at  time  t  c  {0,1,2,...},  then  a  positive  amount 
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acR  can  be  ordered.  Delivery  is  immediate.  The  ordering  cost  c(a)  is  non¬ 
decreasing  in  a  =  0.  A  holding  cost  h(s)  is  incurred  at  the  beginning  of  a  period 
and  is  non-decreasing  as  a  function  of  the  inventory  s  =  0.  If  the  observed 
inventory  s  at  the  beginning  of  a  period  is  negative,  then  a  shortage  cost 
p(-s)  is  incurred.  We  assume  p(  )  to  be  a  non-decreasing  function  of  the  amount 
of  the  shortage.  Finally,  we  assume  that  the  demands  in  successive  periods 
are  i.i.d.  random  variables  with  probability  distribution  function  $(0.  such 
that  the  expected  demand  in  each  period  is  finite,  i.e., 

ft,  d*(r,)  =  :  (3.26) 

Costs  and  rewards  are  discounted.  The  one-period  discount  factor  is  p  <  1, 
so  that  costs  incurred  in  period  t  are  weighted  with  the  factor  c*".  As  was 
the  case  with  the  queueing-control  examples,  the  goal  is  to  maximize  the  total 
expected  discounted  returns  (equivalently:  m'nlmize  the  total  expected  discounted 
costs)  over  an  infinite  horizon,  and  to  find  a  (stationary)  policy  for  which 
this  maximum  is  (approximately)  achieved. 


In  the  notation  of  our  basic  ■lee  *  a 1  on  model  ,  the  one— stage  return  function 
is  given  by 

(  -h(ri)  -  c  ( a )  ,  for  -  0,  a  =  0 

r(s,a)  =  ;  (3.27) 

^  -p(-s)  -  e(a),  for  s  <  n  |  --  '\ 

and  the  (discounted)  transition  probabilities  bv 

p  ( s  ,  a  :  E )  ^  /  n  1  ( s  +  a  -  f.  f  P)d:(:.)  (3.23) 

In  the  remainder  of  this  section  we  shall  she'’  that  Model  IT  is  applicable  under 
certain  conditions  on  c,  h,  p,  and  ;  .  This  involves  verification  of  Conditions 
3.1  -  3.3  for  a  reference  function  V  tha?  is  chosen  appropriately.  Conditions 
3.1  and  3.2  are  satisfied  trivi  illy  since  we  are  considering  a  discounted  problem 
with  costs  only,  so  all  returns  arc  negative  and.  lienee  r+  =  0  (cf.  (3.27)  and 
(3.23)).  Thus  the  only  problem  remaining  is  to  determine  a  V  such  that  Condition 
3.3  is  satisfied.  This  will  be  done  first  for  the  classical  case  with  linear 
holding  and  shortage  costs.  Then  a  model  with  non-linear  costs  will  be  analyzed, 
under  mild  conditions  which  do  not  require  icnvexity  of  the  cost  functions. 
Moreover,  jumps  i-.  these  functions  are  permitted  as  long  as  they  are  limited  in 
magnitude.  As  a  consequence  of  the.se  more  general  conditions,  of  course,  results 
such  as  the  optimality  of  an  (s,S)  policy  do  not  necessarily  hold. 

(a)  The  Classical  inventory  model. 

In  this  case  the  model  as  deset  ibed,  e.g.,  by  Scarf  [23]  will  be  studied. 

> 

The  cost  of  ordering  an  amount  a  =  0  is  given,  by 
c(a)  =  6(a)  K  +  r*n 

where  K  =  0,  c  =  0,  and  6(a)  =  1,  if  a  >  A,  6(a)  =0,  if  a  =  0.  The  holding 


cost  is  linear. 


V*  =  u"v*  =  Unb  =  0,  n  =  1. 


(3.34) 


w 


Def Ine 
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h  :  =  h(]+p+. . .+pn) ,  n  =  0, 
n 


n  > 

p  :  =  p(l+p+...+p  ),  n  =  0, 
n 


(3.35) 


M'  :  =  h  *M,  *p  ( l+2p+. .  .+np  ^)  ,  n  =  1 
n  1 


(Ml:  =  0).  The  induction  hypothesis  is  the  following. 


-h  -s  +  M'  ,  s  =  0 
,,n.  ,  .  <  J  n  n 

U  b(s)  =  'j 

V  max  (p  ,  p+c)-s,  s  <  0 
n 


(3.36) 


Noting  that  (3.31)  implies  that  p+c  =  P+PPn_j  =  Pn  f°r  sufficiently  large  n,  we 
see  that  the  desired  result  will  follow  from  (3.34)  and  (3.36),  upon  letting 
n  -*■  ®. 


Clearly  (3.36)  is  true  for  n  =>  0.  Suppose  that  it  is  true  for  some  n  =  0. 
>  _ 

Case  1  -  s  =  0: 

Un+^b(s)  =  -h*s  +  sup  {-K*<5(a)  -  ca  +  pf  Unb(s+a-C)di>(^) } 


a=0 


=  -h>s  +  sup  (p/s+a[-h  •  (s+a-O  +  m'  ]d<t(E; 
a=0  0  n  n 


)) 


=  -h*s  +  sup  {-p*h  /  (s+a-Od((())  +  pM' 

"  n0  n 


a=0 


-h*s  -  p°h  /  (s-C)d$(0  +  pH' 
ng  n 


-h*s  -  p-h  [s$(s)  -  /  £d$(OI  +  pM' 
no  n 


-(h+p*h  )s  +  p’h  [ s ( 1— <J> (s) )  +  /  (O  ]  +  pM* 
n  n  0  n 


-h  . °  s  +  p  »h  *M  +  pM' 
n+1  n  1  n 


-h  .  *s  +  M'  , 
n+1  n+1 


£J.  i  ^JL  - 
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Case  2  -  s  <  0: 

Define  k  *=  minfn j c=pp^ } .  (Recall  that  (3.31)  implies  that  k  <  ®.) 
suppose  n  <  k,  so  that  c  >  PPn- 

Un+^b(s)  c  p-s  +  tnax(p/  Unb(s-C)d4>(0  ,  c-s  -  K  +  sup(-c-y 

®  y>s 

p/VVy-OdoCC)}} 

0 


<  OO 

=  p'S  +  maxfp/  p  '(s-0d$(0, 

0  n 

c*s  -  K  +  sup  {-c*y+p /  p  >  (y— d<J> ( ^.) } 
s<y<0  0  n 

30 

c-s  -  K  +  sup{-C'V+p/  p  -(y-Od^1;))) 
y  *0  y  n 


=  p“s  +  maxfp'p  -s  -  p*p  »M  , 
n  n  i 


C'S  -  K  +  sup  { (~c+pp  )*v)  -  p*p  , 
_  n  n  I 

s<y<0 


C'S  -  K  -  p «p  *M  } 
n  1 


P'S-p'p  M.  +  max{p-p  'S,  C's-K-c'S  +  o*p  • s , 
n  1  n  n 


=  (p+P'Pn)s  -  pn+l-s 


Now  suppose  n  =  k. 


Uk+^b(s)  =  P’S  +  maxip/  p,  (s-f)d®(0, 

0  k 


cs  -  K  +  sup  (-c-y+p /  P .  ’(y-OdHO! 
s<y<0  0 


c*s  -  K  +  siipf-c'y+pf  p. '(v-O^(s)H 
v»0  y  k 


p-s  +  maxfp'p^-s  -  o’p^'Hj, 

c-s  -  K  +  sup  ( (-c+pp  ) *y }  -  p'p,  *M  , 
s<y<0  ’  *  1 


K? 


First 


+ 


C'S-K} 


C'S 
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=  p-s  +  max{p*p^*s,  c-s} 

=  (p+c)-s 

Finally,  suppose  n  >  k. 

Un+1b(s)  =  p*s  +  max{p /  (p+c) • (s-Od$ (O , 

0 

c'S  -  K  +  sup  f-c'v+o/  (p+c)  •  (y-£)d<f  (£} } , 
s<y<0  0 

c*s  -  K  +  sup(-c  *y+/  (p+c)  •  (y-£)d4>(£)  )) 
y=0  y 

=  P'S  +  max{p(p+c)s  -  p(p+c)M^, 

C'S  -  K  +  sup  { (-cHp(p+c))y)  -  p (p+c)M  , 
s<y<0 

c'S  -  K} 

=  P'S  +  max{p (p+c)s,  c*s) 

=  (p+c)»s. 

This  completes  the  induction  and  so  the  lemma  is  proved. 

Our  main  result  for  this  model  is  contained  in  the  following  theorem. 

Theorem  3.8.  Consider  the  inventory-control  model  with  linear  costs  and 
with  V  defined  by  (3.32).  Then  V*  =  V  =  0,  V*e''/(V)  and  is  the  unique  solution 
in  W(V)  to  the  optimality  equation,  v  =  Uv.  Moreover,  for  any  V^e^(V) , 

||  V*-UnVg||  =pn  || 'V*-Vq  |j  <  <»,  so  that  lJnVp  converges  to  V*  uniformly  and 

geometrically . 

Proof .  In  order  to  apply  Theorem  3.1,  it  remains  to  verify  that  Condition 

3.3  holds: 

||  UV-V]!  <  In  fact,  we  show  that  0  =  UV(s)  -  V(s)  =  -M,  where  M  <®,  for 


all  seS. 
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> 

rase  1  -  3  ”  0  : 

Using  enscnM  nl  1  y  the  same  argument  no  ust>d  to  verify  the  induction  hypothesis 
(3.36)  in  Lemma  3.7,  it  can  be  shown  that 

UV(s)  =  -h's(l-p)  *  +  h*M^*p(l-p) 

and  hence  UV(s)  =  V(s).  The  only  difference  is  that  h  and  M  are  replaced  by 

'  n  n 

-1  -2  — 
theft  respective  limits,  h(l-p)  and  h*M^-pt*-p)  .  On  the  other  hand,  I'V 

r(fo)  +  f0  policy  that  has  f(s)  =  0  for  s  =  0,  and  f(s)  = 

-s  for  s  <  0.  Hence 

UV(s)  =  -h-s  +  o/S(-h(l-p)_1(s-0  ]d$(0  +  oAp+c)(s-()^(0 
0  s 

=  -h  s  -  h-p(l-p)-1[s$(s)  -  /®M*(0] 

c 

+  p  (p+c)  (s(l-4.(s) )  -  f  Cdip ( C ) } 

=  -h«s  -  h»p(i-p)  1>s  +  (h"p(l-p)  ^  +  p ( p+c ) )  * 

i 

,  t s (1— (s>  )  +  /SCdi}>(0)  ~  P (p+c)M, 

0  1 

1  =  -h(l-p)  *s  -  p (p+c)!i^ 

|  =  V(s)  -  p-M  (h(l-p)  ^  +  p  +  c) 

*  ,  I 

=  V(s)  -  M, 

for  sufficiently  large  M  <  <*>. 

Case  2  -  s  <  0: 

Again,  the  proof  that  UV(s)  =  V(s)  =  (p+c)*s  is  essentially  the  same  as  the  proof 
of  (3.36)  in  Lemma  3.7.  On  the  other  hand,  UV  =  r(f0)  +  P(f0)V,  so  that 

L’V(s)  =  p*s  -  K  -  c(-s)  +  p/“(p+c)  (0-f)d6(O 

J  0 


All 
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*  (jH-c)-s  -  K  -  p (frt-c)Mj 
> 

=  (p+c) *s  -  M, 

for  sufficiently  large  M  <  ®. 

Remark  3.11.  As  in  the  queueing  examples  we  can  define  for  each  decision  rule 

f  a  transformed  return  function  ?(f)  by 

f(f ) :  =  r (f)  +  P(f)V  -  V  -  L(f)V  -  V. 

Now  the  transformed  inventory  control  model  (S,  A,  D,  t,  p)  satisfied  the  conditions 

of  Model  I  of  Section  2,  since  ||  sup  ?(f)||  «  ||  UV  -  V  j |  <  M.  Moreover, 

f 

V*  =  V*  -  V. 

Remark  3.12.  The  proof  of  Theorem  3.8  shows  that  UV  =  V.  It  follows  by 
induction  from  the  monotonicity  of  U  that  =  Un  ^V,  so  that  the  convergence 
of  U^V  to  V*  is  monotonically  decreasing. 

Remark  3.13.  The  idea  of  extremal  policies  as  introduced  for  queueing  systems 
cannot  be  used  directly  for  Inventory  systems.  It  is  clear  from  the  proof  of 
Theorem  3.8,  however,  that  any  policy  fo  of  the  form  f0(s)  *  0  (do  not  order) 

for  all  s  >  s+  =  0  and  -s-k  =  f(j(s)  =  ~S  for  s  <  0  and  some  0  <  k  <  ®  satisfies 

||  V*  -  Vf°||  <  ®,  which  implies  that  might  serve  as  a  reference  policy. 

For  such  fo  we  might  define  f(f)  by 

f (f) :  -  r(f )  +  P(f)Vf°  -  Vf°, 

which  automatically  implies  that  f(fo)  ■  0.  Consequently,  the  successive 
approximations  method, 
v0  -  0 

v  =  sup  (f(f)  +  P(f)v  ),  n  -  1, 
n  f  n-i 


> 


converges  to  V*  monotonically , 


i  .e.  , 


v 

n 


v  ,  ,  and  v  -»  V* 
n-1  n 


V*-V 


0 


As  in  the  queueing  examples  the  effect  of  the  shift  transformation  is  to 
subtract  V^°  from  the  value  function  for  each  policy.  So  again  the  value 
functions  are  measured  relative  to  that  of  a  reference  policy,  in  this  case  fj. 
Roughly  speaking,  our  results  say  that  the  inventory  problem  with  linear  cost 
structure  is  equivalent  to  one  in  which  the  costs  one  has  to  incur  to  reach  the 
"feasible  area"  (not  too  far  from  state  s  =  0)  are  subtracted. 


(b)  Inventory  control  with  restricted  order  quantity 

We  consider  the  same  problem  as  described  in  Example  3(a)  with  the  restriction 

that  the  maximal  order  quantity  is  R.  In  this  model  it  is  not  always  possible 

from  states  s  <  0  to  reach  the  "feasible  area  in  one  step.  Hence  a  logical 

reference  policy  will  be  of  the  form:  fo(s)  =  R  for  s  <  s0  =  0-  f0(s)  *  0  for 
>  < 

s  >  -  0;  f(s)  *  R  for  sc[sq,  The  value  function  for  such  a  policy 

will  be  of  the  order: 

(  -h*s(l-p)  \  for  s  =  0 

I-’(s) :  »  <  _j  (3.37) 

p*s (1-p )  ,  for  s  <  0. 


Define  V  by 


_  f  min(-h’s(l-p)  ^  +  h-M  ’p(l-p)  ^,0),  for  s  =  0 

V(s)  =  <  _1  _2  (3.38) 

V.  min(p’sd-p)  +  p’R’p(l-p)  ,0),  for  s  <  0. 


Lemma  3.9.  V*  =  V  *  0 . 


Proof .  The  proof  parallels  that  of  Lemma  3.7.  With  b  again  defined  by  (3.33), 


the  Induction  hypothesis  is  now 
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u’SCs)  ■ 


-h  -s  +  M\  6=0 
n  n 

p  ‘S  +  M",  s  <  0 
n  n 


with  h  ,  p  .  and  M'  defined  by  (3.35)  and 
n  n  n 


(3.39) 


M":  -  p-R-p(l+2p+...+npn_1). 
n 

If  (3.39)  is  true,  the  desired  result  again  follows  from  (3.34)  upon  letting  n  -*  <*> 
It  remains  to  prove  (3.39). 

Case  1  -  s  =  0: 

Since  s^p<  {•}  =  sup  {•},  the  inductive  argument  used  in  Case  1  of 
0=a=R  a=0 

Lemma  3.7  can  be  used  again  here. 


Case  2  -  s  <  0: 

Clearly  (3.39)  is  true  for  n  ■  0. 

Again  define  k  =  min  {n|c  *  pp^}.  For  n  *  k,  the  inductive  argument  used  in 
Case  2  of  Lemma  3.7  can  be  used  again  here  to  show  that 


11% (s)  =  p  *s  =  p  *s  +  M" 
n  n  n 

Suppose  (3.39)  is  true  for  some  n  =  k. 

nXl  ✓  qb 

(s<-R)  U  d(s)  -  p"S  +  raax{p/  p  •  (s-0d<J>(0  +  P*M", 

0  n  n 

c*s  -  K  +  sup^  (-c*y  +  p/  p  •  (y-0<l$(0 )  +  p*M"} 
s<y=s+R  0  n  n 

-  p • 8  +  max(p*p  s,  c*s  -  K  +  supt  {(-c+p*p  )y}} 
n  s<y-s+R 

-p»p  *M,  +  p»M" 

*n  1  n 

*  p-s  +  max{p’Pn*s,  c*s  -  K  +  (-c+p*pn) (s+R) }  +  p*M|j 


Theorem  3.10.  Consider  the  inventory-control  model  with  linear  costs, 
restricted  order  quantity,  0=a=R<=>,  and  V  defined  by  (3.38).  Then  V*=V=0, 

V*eW(V)  and  is  the  unique  solution  in  W(V)  to  the  optimality  equation,  v  =  Uv . 
Moreover,  for  any  VgtW(V),  ||  V'-UnV g ||  =pn|j  V*-Vgj|  <ro,  so  that  UnVg  converges  to 
V*  uniformly  and  geometrically. 

Proof .  Again  we  verify  Condition  3.3  by  showing  that  0  »  UV(s)  -  V(s)  =  -M, 
where  M<®f  for  all  seS. 

Case  1  -  s  =  0: 

The  proof  that  UV(s)  =  V(s)  is  the  same  as  in  Theorem  3.8.  On  the  other  hand, 

UV  *  r(fg)  +  P(fg)V,  with  f0  the  policy  that  has  f(s)  ■  0  for  s  =  -R,  and  f(s)  «  R, 


for  s  <  -R.  Hence 
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UV(S)  =  -h*s  ♦  pfos  r-hCl-P) _1  (s-0  ]d*J(0 

♦  p/'"p(l-p)"1(s-C)«lrfCO 

s 

*  -h's-h'p(l-p)*1  [s^(s)-/'S&M(0] 

o 

♦  l**p(l-p)-1[s(l-rf(s))-/*VW(«l 

s 

=  -h(l-p)  1s  -  p-p(l-p)~1M1 

*  V(s)  -  p(l-p)'1r*1[h(l-p)“I  ♦  p] 

«  V(s)  -  M, 

for  suff i ciently  large  M  <  ®. 

Case  2  -  s  <  0: 

The  proof  that  UV(s)  =  V(s)  is  essentially  the  sane  as  in  the  inductive  step  of  the 

proof  of  Lemma  3.9  for  n  =  k.  The  only  difference  is  that  p  and  ft’  are  replaced 

n  n 

- 1  7  > 

by  their  respective  limits,  p(l-r)  and  p-R.p(l-p)*  On  the  other  hand,  \N  = 

r(f  )  +  P(f  )V,  so  that 
o  o 

(s  <  -R)  UV(S)  =  P*s  -  K  ~  c *R  +  p/^pfl-p)’1 (s  ♦  R  -  t)d^(e) 

O 

*  p* s  «■  P*p(l-p)_1s  +  p*R»p (1-p)-1  -  K  -  c-R  -  p*p (1-p) 

=  p(l-p)-1s  «■  p*R-p(l-p)'2  -  [K  ♦  c-R  ♦  p*p(l-p)_1  (H 

♦  R(l-p)’1)] 

»  V(s)  -  M, 

for  sufficiently  large  M  <  «. 


4  f> 


(-R  =  s  <  0)  UV(s)  =  p-s  ♦  p/°°p(l-p)_1 (s-Z)dt(O 

o 

=  p- s  +  p-pd-p)'^  -  p-pfl-p)'1!^ 

=  p(l-p)~1s  +  p*R*p(l-n)  "  -  p.R.p(l-p)-2  -  p.pfl-p)'1?^ 

=  V(«.)  -  p-pd-p)"1^  +  R(l-p)'1] 

=  V(s)  -  M, 

for  sufficiently  large  M  <  “. 

This  completes  the  proof  of  Theorem  .1. 10. 

Remark  3.14.  The  economic  regularity  condition  (3.31),  c  <  p(l-p)  *p,  is  not 

needed  in  the  model  with  restricted  order  quantity.  In  fact,  if  c  =  p(l-p)  *p, 
then  k  =  =>  and  the  proof  of  Lemma  3.0  shows  that  Unb(s)  =  Pn*s»  for  all  s  <  0  and 
n  =  0.  Hence,  in  this  case,  V  can  he  defined  as  V(s) :  =  p(l-p)  *s,  for  s  <  0. 

Remark  3 . 1 S .  It  should  he  clear  that  it  is  not  difficult  in  general  to  find  a 

good  reference  function  V  or  W.  Usually  the  structure  of  the  problem  will  indicate 
the  direction  in  which  to  search  for  such  a  function.  This  was  true  for  the 
queueing  examples  as  well  as  for  the  inventory-control  models. 

Remark  3.16.  Once  again  the  convergence  of  l/'v  to  V*  is  monotonically  decreas¬ 
ing,  since  UV  =  V  for  V  defined  by  (3.38). 

Remark  3.17.  By  defining  r(f):  =  r(f)  ♦  P(f)V  -  V,  the  inventory-control 
problem  with  restricted  order  quantity  can  he  transformed  into  a  problem  that 
satisfies  the  conditions  for  Model  I. 

(c)  Inventory  control _w itli  non-linear  costs 

Nov/  we  shall  treat  the  inventory  control  model  as  described  in  the  introduc¬ 
tion  of  example  3,  without  the  restriction  to  linear  costs.  He  shall  need  the 
following  condition: 


c(-s)  =  p>p(-s)  ♦  s  =  0,  (3.40) 


uhcic  m,,  -<  -.  Tiio  ecoiionii c  i nterpretat ion  of  this  condition  is  that  below  some 
point  sq  =  0  it  cannot  he  much  worse  to  order  up  to  zero  than  to  stay  for  one 
period  with  the  shortage  s  =  sq.  Note  that  for  the  case  of  linear  costs,  this 
condition  is  stronger  than  (3.31),  but  still  economically  plausible.  In  addition 
we  assume  that  ordering  all  at  once  cannot  be  much  worse  than  ordering  separately, 
i.e. , 

c  (a+b)  =  c(a)  +  c(b)  +  flj,  a,b  =  0,  (3.41) 
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« 

(s  =  0)  V(s)  =  SUP  E*[  E  d(X  )] 

TT  t  =  0  1 

O0 

=  sup  l  r(Xt,At]  =  V*(s), 

77  t  =  0 

since  r(s,a)  =  d(s),  for  all  seS,  aeA; 

(s  <  O')  V*  (s)  =  UV* (s) 

=  -p(-s)  ♦  sun  {-  c(a)  ♦  p /  V*fs+a-C)d«((0 ) 
a=0  n 

=  -p(-s)  ♦  max  {sup  {  -  c(a)  +  p /  V*(s«a 
0=a--s  0 

syp  {  -  c(a)  +  p/  V*(s+a-C)d(6(C)}} 
a=-s  0 

=  -p(-s)  +  max  (sup  {-  c(a)  -  p /  p(£-s-a)dd(£)  1,  -  c(-s)} 
0^a<-s  0 

=  -p(-s)  -  c(-s)  ♦  max  {sup  {  c(-s)  -  c(a)  -  pp(-s-a)),  0) 

0^a<-s 

=  -p(-s)  -  c(-s)  ♦  max  {sup  {  c(-s-a)  -  pp(-s-a)}  +  M  ,  0} 

0=a<-s 

=  -p(-s)  -  c(-s)  +  M2  +  Mj 
This  completes  the  proof  of  the  lenna. 

Theorem  3.12.  Consider  the  inventory-control  model  with  non-linear  costs  satis¬ 
fying  (3.40),  (3.41),  and  (3.42),  and  V  defined  by  (3.43).  Then  V*  =  V  =  0, 

V*e  7(V)  and  is  the  unique  solution  in  '.'(V)  to  the  optimality  equation,  v  =  Ik-. 
Moreover,  for  any  VQe  Uf(V),  ||V*  -  UnVQ J {  =  pn||v*  -  V0 1 1  <  «,  so  that  I^Vq  con¬ 
verges  to  V*  uniformly  and  geometrically. 

Proof :  Let  f^  be  the  policy  that  has  f0(s)  =  0  for  s  ~  0,  and  f^(s)  =  -s  for 

r*  ___ 

s  <  0.  1/e  shall  show  that  | |V  -  Vr°|J  <  This  will  imply  that  ||V  -  V*||  <  ®, 

||UVf0  -  Vf0||  <»,  and  | |V*  -  Vf° | | 


<  «D 


40 


—  > 

It  suffices  to  show  that  V  (s)  -  Vfs)  =  -Mt  for  all  where  M  <  °°#  This 

will  be  done  inductively.  First  observe  that 


V(s)  =  sup  if  [?  d(X  )] 

TT  t  =  0 


=  <C  l?  dCX  )] 

s  t=0 

n-1 

=  lim  E*c  [  I  d(X  )] 

Tr*-»  t=0 


Vf=fs)  *  -»(-s)  -  c  *  p/V«(-0^m 

n+l  "j 

-  -pf-sl  -  c(-s)  +  p /  V  -  Pf' 

c  n 

=  -  c(-'0  -  p/°°/’p(C)  +c (£) O  -  f*’ 

c 

~  -p(-s)  -  <:(■-•■•)  -  p/' -  P^’->+ 

=  -p(-s)  -  c(-s)  -  p[(l*p)f’4  +  !,2  +  :,1 
=  -p(-s)  -  c(-s)  -  M 

-  Vi(s)  - 

for  sufficiently  lar^e  M  <  ». 

This  completes  the  proof  of  the  theorem. 

Remark  3.18.  The  inventor/  ronuvl  r*<>'  with  rwial  o.-,i  'U^iuu' 
restricted  order  uuantif  can  he  handled  in  a  similar  vay,  provided 
•  (_(S+R))  -  p(-n)  >  -M  uniformly  in  s  <  -p. 


:•  l 
* 
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